## On Improving Dependability of Analog and Mixed-Signal SoCs: A System-Level Approach Muhammad Aamir Khan #### Members of the dissertation committee: | Prof. dr. ir. | G.J.M Smit | University of Twente (promoter) | |---------------|------------------|-----------------------------------------------| | Dr. ir. | H.G. Kerkhoff | University of Twente (co-promoter) | | Prof. dr. ir. | A. Pras | University of Twente | | Prof. dr. ir. | A.J.M. van Tuijl | University of Twente | | Prof. dr. | J. Figureras | Universitat Politècnica de Catalunya (Spain) | | Prof. dr. | A. Richardson | Lancaster University (United Kingdom) | | Dr. ir. | S. Hamdioui | Delft University of Technology | | Prof. dr. | P. Apers | University of Twente (chairman and secretary) | This work has been carried out as part of the Catrene project "TOETS" [CT302] and supported by the Netherlands Enterprise Agency. CTIT Ph.D. Thesis Series No. 14-328 Center for Telematics and Information Technology University of Twente, P.O. Box 217, NL-7500 AE Enschede, The Netherlands. Copyright © 2014 by Muhammad Aamir Khan, Enschede, The Netherlands. Cover designed by Muhammad Aamir Khan. All rights reserved. No part of this book may be reproduced or transmitted, in any form or by any means, electronic or mechanical, including photocopying, recording, or by any information storage or retrieval system, without prior written permission of the author. Typeset with Microsoft Word 2010. This thesis was printed by Gilderprint Drukkerijen, the Netherlands. ISBN 978-90-365-3777-3 ISSN 1381-3617 (CTIT Ph.D. Thesis Series No. 14-328) DOI 10.3990./1.9789036537773 ## ON IMPROVING DEPENDABILITY OF ANALOG AND MIXED-SIGNAL SOCS: A SYSTEM-LEVEL APPROACH #### DISSERTATION to obtain the degree of doctor at the University of Twente, on the authority of the rector magnificus, prof. dr. H. Brinksma, on account of the decision of the graduation committee to be publicly defended on Friday, 7<sup>th</sup> of November 2014 at 16.45 by Muhammad Aamir Khan born on 25<sup>th</sup> of June 1979, in Mianwali, Pakistan #### This dissertation is approved by: Prof. dr. ir. G.J.M Smit University of Twente (promoter) Dr. ir. H.G. Kerkhoff University of Twente (co-promoter) #### **ABSTRACT** Dependability of electronic systems, being an indispensable part of our civilian, industrial and military applications, has become increasingly important as a result of continuous technology scaling. The dependability or human reliance on these electronic systems has decreased as a result of new technologies which are far less mature as compared to older technologies. The electrical characteristics of the transistors and the wires will vary statistically in a spatial and temporal manner, directly translating into design uncertainty during fabrication and even during operational life. This combined impact of manufacturing uncertainty (e.g. process variability) and temporal degradation (aging) results in time-dependent variability and hence the means to impact the functionality and dependability of electronic systems. Unfortunately, traditional worst-case design slacks or margins are not sufficient anymore to capture the time-dependent system variability, especially in new technology nodes, and would result in over-pessimistic implementations with significant penalties in terms of area/delay/energy. As a result, time-dependent uncertainties become a great threat to the design of complex systems-on-chip (SoC) implementations and their dependability. It becomes extremely important in case of safety- or mission-critical systems because the dependability failure of these systems may result in enormous cost damage or even loss of human lives. Therefore, maintaining or achieving high system dependability in safety- or mission-critical systems is the most important property. Analog and mixed-signal front/back ends, being an important part of most critical systems especially in safety-critical (e.g. automotive, medical etc.) and mission-critical (e.g. military, space etc.) systems have received relatively little attention with regard to dependability. The dependability of these analog and mixed-signal front/back ends is essential in order to have a dependable interface between the real world and digital world. This is the main goal of the current research where new system-level strategies have been explored and investigated in order to enhance the dependability of analog and mixed-signal front ends especially during their operational life. A system-level hardware platform has been proposed as a potential solution to enhance the dependability of analog and mixed-signal front ends. The idea is to diagnose the performance of analog and mixed-signal front ends at regular intervals of time and in case of performance deviations from the designed specifications take the necessary repair actions. The proposed hardware platform is based on digitally-assisted redundant/spare hardware units concept where a separate hardware block is responsible for taking performance measurements and the corresponding repairing actions via built-in digital tuning capabilities or using a switch matrix to replace the faulty hardware units with the fault-free spare hardware units. The theoretical/mathematical dependability evaluation of the proposed hardware platform shows dependability improvements at the cost of extra hardware. However, a new hardware platform is proposed that can achieve similar results of dependability with reduced area overhead. Further improvements in area, speed and power requirements along with different values of dependability are optimized using the concept of library of dependable IPs; vii proposed in this research work. The idea is to use the design-stage simulation results to construct a library of dependable IPs with a number of IPs having the same functionalities but different values of dependability along with their speed, area, and power overheads. This library can be further used in selecting the best combination of IPs resulting in an optimization between the required dependability and the associated penalties in terms of area, speed and power. This library of dependable IPs, coupled with the new hardware platform, is used to achieve higher dependability in analog and mixed-signal front ends. viii The above proposed hardware platform resolves the time-dependent degradation issues in analog and mixed-signal front ends during their operational life. However, it has been observed that the initial variations as a result of fabrication-related process variations have a significant effect on the degradation behavior of similar systems. Some systems will degrade more quickly as compared to other similar systems. This issue has been resolved by introducing a database of system specifications and runtime logged measurements of the system performance parameters to exactly know the degradation behavior of a system and to know more precisely at what point in time the repair actions have to be taken in order to avoid dependability failures. Furthermore, in order to avoid the potential circuit over loading effects while directly interacting with the system internal nodes for performance measurements, an indirect novel technique to estimate the system performance during its operational life is also presented. The indirect technique, in combination with the database of system specifications and logged values of system performance measurements, is further used to effectively achieve higher dependability in analog and mixed-signal front ends. The time-dependent and process-induced initial-value dependent degradation effects are further related to the working stress conditions and the corresponding duration of stress time. Among them are the working stress temperature and working stress voltage. These stressors can be differentiated as short-term and long-term based on the duration of time these stressors are applied. This will further lead to short-term variations (or temporary) and long-term variations (could be permanent). These short-term and long-term variations need to be differentiated in order to efficiently enhance the dependability of analog and mixed-signal front ends. This concept has been incorporated in the proposed strategy where a continuous monitoring of the working stress temperature and working stress voltage has been performed in order to take the necessary actions in maintaining these stressors within their specifications and for selecting the appropriate repair strategies in order to further enhance the dependability of the analog and mixed-signal front ends. As an example of a relatively complex system and an essential part of analog and mixed-signal front ends, the charge-redistribution successive approximation register (SAR) ADC has been considered to analyze the time-dependent degradation issues in its static and dynamic performance parameters. Usually, transistor-level aging simulations are very time consuming for these types of systems. Therefore, behavioral models, that have been frequently used to analyze the performance of electronic systems, are used to simulate the degradation effects in a SAR ADC. The SAR ADC has been subdivided into smaller sub-blocks and the degradation information of each sub-block has been incorporated in their respective behavioral models to simulate the degradation effects in the complete SAR ADC. A flexible simulation setup has been constructed in the LabVIEW environment where different important parameters of the SAR ADC can be selected to see the corresponding degradation effects in its static and dynamic performance parameters. This degradation information has been further used in proposing system-level strategies to enhance the dependability of SAR ADCs during their operational life. ix #### **ACKNOWLEDGEMENTS** This dissertation is not only a result of the continuous hard work, enthusiasm, perseverance, and consistence efforts during the past four years, but also the encouragement, cooperation, and support from a number of people. Therefore, I would like to take this opportunity to pay my sincere thanks to all of them. First of all, I would like to express my special gratitude to my supervisor Dr. ir. Hans G. Kerkhoff, who has remained a tremendous mentor throughout my research work. His continuous scientific nourishments and encouraging attitude gave me the confidence to grow as an independent scientific researcher. He taught me the art of critically analyzing the scientific issues, the scientific skills to think for novel solutions, and the effective methodology of technical writing and presentation. His cooperation, open-minded attitude, long scientific discussions, and open-door policy has helped me very substantially in completing the research and obtaining meaningful results. His advice on both research as well as on my career have been priceless. His relentless determination and dedication to scientific research, sincerity and devotion towards his scientific duties, and the thrill to attain perfection will always remain a source of inspiration and guidance to me. I would also like to pay my special thanks to my promoter Prof.dr.ir. G.J.M. Smit for his continuous support, encouragement, and devoting his precious time in reviewing my thesis. Moreover, I would like to thank all the members of dissertation committee for their valuable comments and participation in the public defense of this dissertation. Furthermore, I would like to thank my research colleague Jinbo Wan for his valuable support during all scientific discussions and providing me the Cadence simulation environment to run aging simulations. I would also like to thank all of my former and current office colleagues: Vincent Kerzerho, Alireza Rohani, Ahmed Ibrahim, Andreina Zambrano, Yong Zhao, Xiao Zhang, Xiaoqin Sheng, Philip Hölzenspies, Berend Dekens, and Koen Blom for valuable suggestions, help and making my stay enjoyable. Also special thanks to Bert Helthuis for providing all sorts of technical assistance and to Marlous Weghorst, Nicole Baveld, and Thelma Nordholt for their supporting role in all administrative matters during my stay. I would also like to thank all of my Pakistani friends and families with whom we arranged different cultural, social, and sport events. It has been a source of tremendous pleasure and enjoyment to spend time with them. In the end, I would like to pay my deepest thanks to my family. My beloved mother, who is not in this world anymore but her love, affection and encouragements will always remain with me throughout my life. I can not express my gratitude for her in words, she had always prayed for my success and her lovely memories will remain my greatest strength. My cherished father, who has always guided and encouraged me in every aspect of life and his constant inspiration and support kept me focused and motivated throughout my studies. It is only because of his prayers that I sustained thus far. I also sincerely acknowledge my brother, his smiling and advocating attitude also kept me devoted to my research work. I am especially thankful to my wife for her patience and caring attitude during my studies. I extend my wholehearted thanks for her love, respect and support. Finally, my greatest regards to the Almighty Allah for bestowing upon me the courage to face the complexities of life and completing this research successfully. Muhammad Aamir Khan Enschede, November 2014 ## **CONTENTS** | Abstra | ct | | vii | |--------|---------|---------------------------------------------------------|------| | Ackno | wledgen | ients | xi | | Conte | nts | | xiii | | 1 | Intro | duction | 1 | | | 1.1 | Dependability and its Importance | | | | 1.2 | Dependability Issues In Advanced Technology Nodes | | | | 1.3 | Dependability Issues in Analog and Mixed-Signal Systems | | | | 1.4 | Traditional Dependability Improvement Pitfalls | | | | 1.5 | Problem Statement and Research Questions | | | | 1.6 | Presented Approach | | | | 1.7 | Thesis Organization | | | | 1.8 | References | 7 | | 2 | Rack | ground and Related Work | 9 | | | 2.1 | Introduction | | | | 2.2 | Dependability | | | • | 2.2.1 | Dependability Attributes | | | | 2.2.2 | Dependability Impairments | | | | 2.2.3 | Dependability Means | | | 2 | 2.3 | Selected Dependability Attributes | | | 2 | 2.4 | Dependability Theory | | | | 2.4.1 | Reliability Theory | | | | 2.4.2 | Maintainability Theory | 14 | | | 2.4.3 | Availability Theory | 15 | | 2 | 2.5 | Degradation Mechanisms and System Dependability | 15 | | | 2.5.1 | Bias Temperature Instability | 16 | | | 2.5.2 | Hot Carrier Injection | 16 | | | 2.5.3 | Time-Dependent Dielectric Breakdown | 17 | | | 2.5.4 | Electro-migration | 17 | | 2 | 2.6 | Analog and Mixed Signal Dependability Improvements | 17 | | | 2.6.1 | Brief History | 17 | | | 2.6.2 | Recent Practices | 18 | | | 2.6 | 5.2.1 Device-Level Efforts | 18 | | | 2.6 | 5.2.2 Design-Level Efforts | 18 | | | 2.6 | 5.2.3 Simulation-Tool Level Efforts | 19 | Degradation Analysis Examples 20 Degradation Mitigation Examples 20 Considerations in the Current Research 22 Working Principle 49 Modelling Analog and Mixed-Signal Front-End.......55 2.6.2.4 2.6.2.5 3.7.1.3 3.9.3.1 3.9.4.1 3.9.4.2 3 10 1 1 3.8 3.9 3.9.1 3.9.2 3.9.3 3.9.4 3.10.1 3.10 2.6.3 2.6.4 | | A. Modelling a Temperature Sensor | | |---------|-------------------------------------------------------------------|-----| | | B. Modelling an Operational Amplifier | 56 | | | A. Modelling the Analog-to-Digital Converter | 56 | | 3.10.2 | 2 Simulation Setup | 57 | | 3.1 | 10.2.1 Simulation Results | 57 | | 3.11 | Conclusions | 60 | | 3.12 | References | 60 | | 4 Run | time Reliability Estimations and System Dependability | 63 | | 4.1 | Introduction | 63 | | 4.2 | Hierarchical Flow of System Specifications | 65 | | 4.3 | Variations in System-Level Parameters | 66 | | 4.3.1 | Parameter Variations vs Temporal Degradations | 69 | | 4.4 | Runtime Reliability Requirements | 70 | | 4.5 | Critical Performance Parameters | 73 | | 4.6 | Quantitative Runtime Reliability Estimation | 74 | | 4.7 | Proposed Dependability Workflow | 76 | | 4.7.1 | Working Principle | 76 | | 4.7.2 | Dependability Improvements | 78 | | 4.8 | Simulations and Results | 78 | | 4.8.1 | Simulation Setup | 79 | | 4.8.2 | Simulation of Degradation Behaviours | 80 | | 4.8.3 | The Simulator GUI | 80 | | 4.8.4 | Randomly Selected Values | 82 | | 4.8.5 | The Simulation Results | 83 | | 4.8.6 | Possible Overhead and Overall Performance | 85 | | 4.9 | Indirect Reliability Estimation | 85 | | 4.9.1 | Design-Stage Degradation Rate Extraction | 86 | | 4.9.2 | Indirect Reliability Estimation Approach | 89 | | 4.9 | 9.2.1 Calculations for an Example Target System | 90 | | 4.9 | 9.2.2 Simulation Setup | 93 | | 4.9 | 9.2.3 Simulation Results | 94 | | 4.10 | Conclusions | 99 | | 4.11 | References | 99 | | 5 Diffe | erentiating Between Short-Term and Long-Term Dependability Issues | 103 | | 5.1 | Introduction | | | 5.2 | Supply-Voltage and Temperature Variations | 104 | | 5.2.1 | Supply-Voltage and Temperature Variations in Digital Systems | 105 | | 5.2.2 | Supply-Voltage and Temperature Variations in Analog Systems | 106 | | 5.2.3 | The Role of Supply-Voltage and Temperature Variation | 108 | | | 5.3 The Importance of Separating NBTI and Supply-Voltage and Temperatur<br>108 | | √ariations | |-----|--------------------------------------------------------------------------------|-------------------------------------------------------------------|------------| | | 5.4 | Enhancing the System Dependability | 100 | | | 5.5 | Dependable Hardware Architecture | | | | 5.5.1 | Principle of Workflow | | | | 5.5.2 | Pros and Cons of Proposed Approach | | | | 5.6 | Simulations and Results | | | | 5.6.1 | Target System | | | | 5.6.2 | The Simulation Environment | | | | 5.6.3 | Simulation Results of the Target System. | | | | 5.6.4 | Comparison of the Simulation Results | | | | 5.7 | Conclusions | | | | 5.8 | | | | | 3.8 | References | 110 | | 6 | Perfe | ormance Degradation Analysis and Dependability Enhancement of SAR | ADCs 121 | | | 6.1 | Introduction | 121 | | | 6.2 | The Charge Redistribution SAR ADC | 123 | | | 6.2.1 | The Working Principle of the ADC | 123 | | | 6.2.2 | Modelling Degradation Effects in the SAR ADC | 124 | | | 6.2 | 2.2.1 Modelling the Buffer and Comparator Degradation Effects | 125 | | | 6.2 | 2.2.2 Modelling the DAC Capacitor-Array Degradation Effects | 126 | | | 6.3 | SAR ADC Performance Analysis | 128 | | | 6.4 | Simulation Setup | 130 | | 6.5 | | Simulation Results. | 132 | | | 6.5.1 | Static Performance Parameter Degradation Results | 132 | | | 6.5 | 5.1.1 The SAR ADC Output Offset Voltage Degradation | 132 | | | 6.5 | 5.1.2 The SAR ADC GAIN Degradation | 133 | | | 6.5 | 5.1.3 The SAR ADC DNLE and INLE Degradation | 134 | | | 6.5.2 | Dynamic Performance Parameter Results | 135 | | | 6.5 | 5.2.1 SAR ADC SINAD, THD and ENOB Degradation | 135 | | | 6.5.3 | Summary of Simulation Results | 138 | | | 6.6 | Potential Critical Performance Parameters | 139 | | | 6.7 | Proposed Dependability Enhancement Strategies | 140 | | | 6.7.1 | Monitoring Mechanisms | 141 | | | 6.7.2 | Controlling Mechanisms | 141 | | | 6.7 | 7.2.1 Controlling the Buffer and Comparator Offset | 141 | | | | B. The Offset Cancellation Technique | 141 | | | | C. Digital Tuning Techniques for Offset | 142 | | | 6.7 | 7.2.2 Controlling the DAC Capacitor Array Values | 142 | | | 6.7.3 | Dependability Enhancement Strategy | 143 | | | | | | | 6.7 | 3.1 Dependable Hardware Architecture | 144 | |------------------|-----------------------------------------------------------------------|-----| | 6.8 | Conclusions | 145 | | 6.9 | References | 145 | | 7 Conc | clusions, Contributions and Future Work | 149 | | 7.1 | Summary of the Research Work | 149 | | 7.2 | Answers to Research Questions | 152 | | 7.3 | Conclusions and Main Contributions of our Research Work | 153 | | 7.3.1 | The Dependable Hardware Platform | 153 | | 7.3.2 | The Library of Dependable IPs | 154 | | 7.3.3 | The Dependable Workload-Sharing Duplication System | 154 | | 7.3.4 | Process-Induced Initial-Value Dependent Workflow | 155 | | 7.3.5 | Direct Runtime Reliability-Estimation Technique | 155 | | 7.3.6 | Indirect Runtime Reliability-Estimation Technique | 155 | | 7.3.7 | Differentiating Between Short-Term and Long-Term Dependability Issues | 156 | | 7.3.8 | Behavioral Model-Based Degradation Analysis System | 156 | | 7.3.9 | A Flexible Degradation-Analysis System for SAR ADCs | 156 | | 7.4 | Possible Limitations of this Research Work | 157 | | 7.5 | Future Work and Recommendations. | 158 | | 7.6 | References | 158 | | Abbreviations | | 161 | | List of Publicat | tions | 165 | | Biography | | 167 | # CHAPTER 1 **INTRODUCTION** ABSTRACT — This chapter presents an introduction to the research presented in this thesis. The results of technology scaling in CMOS technology on one side have improved performance, power consumption and fabrication costs. On the other hand they have introduced complex dependability issues in electronic system design. These electronic systems, being an indispensable part of our daily life, demand an increasing dependability especially in safety-critical applications. Traditional methods to cope with these dependability issues are not suitable anymore for electronic systems designed in these advanced technology nodes. This requires new methodologies at device, circuit and system levels. The research in this thesis investigates these issues and provides solutions at system-level in order to improve the dependability of analog and mixed-signal circuits and systems on a chip. CMOS technology has been the dominant integrated circuit (IC) technology for nearly four decades following the trends predicted by Moore's Law. This trend of ongoing technology scaling has resulted in a revolution in electronic industry and the electronic system performance has increased multiple orders in scale. On one side it has allowed us possible to integrate multi-billion of transistors (NViDIA: 7.1 billion in 28 nm) on a single chip [Hal12]. However, on the other hand power, energy, and variability issues have increased. The reduced transistor geometries and the corresponding increase in transistor densities have made it possible to integrate complex systems on a chip. The above trend also includes analog, RF and mixed-signal modules in these chips. Today, these electronic systems are an indispensable part of our life. They are frequently utilized to support the human activities in civilian, industrial, and military applications. Sometimes their presence is easy to recognize, like in the automatic vending machines, digital clocks, desktop or laptop computers etc. Sometimes their presence may not be easily recognizable, like in an electro-mechanical unit controlling the operations of the engine or brakes of our car. The human dependence on these systems relies on the services delivered by these electronic systems. The quality of services delivered becomes extremely important in case these electronic systems are used in safety-critical applications, where any failure may result in loss of human lives, damage of environment, or loss of money. This reliance in the ability of an electronic system to deliver the agreed services in the specified time is called the dependability of the system. System-level techniques that can be used to enhance this reliance in delivering agreed services or system dependability of analog and mixed-signal systemson-chip is the main subject of this thesis. The remainder of the chapter is organized as follows. The dependability of electronic systems and its importance in our daily life is presented in section 1.1. The advances in recent technology scaling have introduced many problems that can degrade the dependability of these electronic systems, which are discussed in section 1.2. Section 1.3 discusses the dependability issues in analog and mixed-signal systems as a result of technology scaling. Furthermore, the possible pitfalls in traditional dependability improvement methodologies are briefly discussed in section 1.4. The research problem tackled in this thesis and the presented approach are summarized in sections 1.5 and 1.6 respectively. Finally, the overall thesis organization and some important references are presented in sections 1.7 and 1.8 respectively. #### 1.1 DEPENDABILITY AND ITS IMPORTANCE Dependability on electronic systems has become essential in our modern-day society. It represents the degree of user confidence that the system will operate as expected and that the system will not fail in normal use [Avi01]. In safety-critical systems it is the most important property. If these systems fail to deliver their services then serious problems and significant losses may result. Usually systems where a user cannot trust the normal behavior will be simply rejected. It may lead to further rejection of other products from the same company believing that these products are perhaps untrustworthy as well. In some cases a dependability failure of a sub-system may result in a complete system failure and hence result in enormous cost damage. For example, failure in the control system of a reactor or an aircraft navigation system may result in damage which is orders of magnitude larger than the cost of the control system itself. In case there is a mechanical problem in an aircraft, then as a requirement of dependability the whole aircraft should not depend on that one component (single point of failure). Therefore, usually these critical systems have backup or redundancy systems built into their designs. For example, aircrafts normally have more than one engine to backup if one engine fails. Therefore, maintaining or achieving high system dependability especially in safety-critical systems is the most important property. The next section will briefly discuss how the technology scaling has introduced dependability problems in integrated electronic systems. #### 1.2 DEPENDABILITY ISSUES IN ADVANCED TECHNOLOGY NODES Traditionally, technology scaling has improved electronic circuits in their performance, low energy consumption and lower die cost. However, new technologies are far less mature as compared to older technologies because they require new materials and process steps that have not yet been thoroughly characterized [Mae05]. Moreover, temporal degradations as a result of smaller feature size and interfaces (wires) are increasing due to increasing electrical fields and temperatures [Gro05]. The electrical characteristics of the transistors and the wires will vary statistically in a spatial and a temporal manner, directly translating into design uncertainty during fabrication 2 Η a result, time-dependent uncertainties become a great threat to the design of complex systems-on-chip (SoC) implementations. Characterizing a number of new materials, for example high-k dielectrics, and their interaction with degradation mechanisms is extremely difficult [Miy02, Rib05]. Furthermore, supply voltage scaling has been saturating in order to keep sufficient headroom between the transistor threshold voltage and the supply voltage, hence increasing the electrical fields and stress conditions for these scaled devices. In addition, effects like: and even during operational life. This combined impact of manufacturing uncertainty (process variability) and temporal degradation results in time-dependent variability. Unfortunately, traditional worst-case design slacks or margins are not sufficient anymore to capture the temporal degradation in circuits and hence systems [Gro05]. As - soft-breakdown (SBD) in gate oxide of transistors (especially dramatic in high-k oxides) [Gro05], - Negative Bias Temperature Instability (NBTI) issues in the threshold voltage of the PMOS transistors [Red02]. - Hot Carrier Injection (HCI) issues in the drain current of MOSFETs [Bra09]. - Electro-Migration (EM) problems in copper interconnects [Bru05], breakdown of dielectrics in porous low-k materials [Tok05], are now becoming clear threats for the functional operation of the circuits and systems in near future technologies. The net result is that it becomes increasingly difficult to guarantee the life time and hence the dependability of electronic systems in new technology nodes. #### 1.3 DEPENDABILITY ISSUES IN ANALOG AND MIXED-SIGNAL SYSTEMS With the introduction of nano-scale CMOS technologies, analog and mixed designers are faced with many new challenges at different phases of design. These challenges include severe degradation in device matching characteristics as a result of device and lithographic quantum limits [Lew09]. Non-idealities in scaled technologies also have a significant effect on analog and mixed-signal systems, including effects on gain, linearity and noise figure. Analog and mixed-signal systems, being an important part of most critical systems especially in automotive, medical and military systems have received little attention with regard to dependability. With designs moving towards smaller dimensions, electric fields in the channels are becoming larger, causing more energetic electrons to damage the channel-oxide and hence degrading the circuit performance. The introduction of new surface-channel pMOSFETs for analog circuits, on one side, has made it possible to fabricate both digital and analog circuits on the same chip, while on the other hand, has increased the effects due to NBTI and HCI [Jha05] and hence degradation in the analog and mixed-signal circuit performance. Furthermore, in case of analog circuits, dc biasing voltages always exist irrespective of the input signal. In addition, because of the high-density of digital circuitry present nearby the analog circuitry on the same chip, a high temperature may also exist in addition to the challenges of the applied dc gate and Α P Т E R 1 drain voltages. This would result in a continuous stress (voltage and temperature) in analog circuits. As many analog circuit operations require matched parameters therefore any mismatches introduced by these continuous stresses will cause performance degradations or circuit failures [Jha05] and can impose a fundamental limit to the analog and mixed-signal circuit dependability. The next section will briefly discuss the difficulties associated with enhancing the dependability of these analog and mixed-signal systems using *traditional* techniques. This will lead to the formulation of the problem statement for this research thesis. #### 1.4 TRADITIONAL DEPENDABILITY IMPROVEMENT PITFALLS In new technology nodes, the traditional worst-case analysis and system design paradigms are breaking down because of the increasing dynamism present in modern applications. The way degradation problems appear within these electronic systems is a quite random process and it depends on the actual operating conditions: time, temperature and stress voltages etc. [Sta01]. This is especially true for large circuits and systems featuring many transistors which can undergo significantly different stress conditions while executing dynamic applications. This fact simply indicates that innovation in electronic design and analysis has to take place to counteract the impact that temporal parametric degradations will have on the actual useful life-time of electronic systems. Traditional worst-case analysis, where designers tune their electronic designs to meet the performance constraints for all the corner-points, is still widely used in industry. However, it suffers from a number of disadvantages. Selected corner points are usually very pessimistic because it is extremely unlikely that all the parameters will have their maximum or minimum values simultaneously. Therefore, the design margins required to make the analog and mixed-signal systems operational under all corner conditions are excessive. Furthermore, the number of parameters affected by timedependent variability becomes very large (e.g. ADC static and dynamic parameters as explained in Chapter 6). This means that analog and mixed-signal system designers will have to deal with parameter spaces of many dimensions and an extremely large number of corner points. Finally, worst-case analysis techniques cannot handle the impact of intra-die time-dependent variability, which is spatially uncorrelated in nature [Naj05]. This is because the electrical parameters of each transistor would become an additional axis in the parameter space and the complexity would become unmanageable. This means the worst-case margin added by system designers on top of the worst-case circuit tuning already performed by circuit designers will result in increasingly larger safety margins. In short, with design-time tuning of the electronic systems it will be very difficult to meet the performance constraints during the operational life. This leads to exploring and investigating new solutions which are the subject of this research thesis as formulated in the next section. Being the interface between the real world and the digital world, analog and mixedsignal front/back ends are an essential part of most safety-critical systems. The goal of this research is to explore and investigate new techniques that can potentially be used to enhance the dependability of these analog and mixed-signal front ends despite the unsuitable traditional worst-case design techniques. As discussed above, on one side the technology scaling has improved electronic circuits in their performance, low energy consumption and lower die cost. While on the other hand, it has introduced new special, temporal and dynamic variations. The focus of our research will be on temporal variations that can result in temporal degradation of electronic system performance and hence a potential cause of dependability degradation. Among the different parts of an analog and mixed-signal front end only the amplifiers and analog-to-digital converters (ADCs), being relatively generic to every analog and mixed-signal front end, are considered for dependability investigations. The sensor/actuator part being relatively different for every application and quite often related to micro-electro-mechanicalsystems (MEMS) has not been considered for dependability improvements. exclusion of sensor/actuator parts will have no influence on the proposed dependability enhancement techniques presented in this thesis. Different levels of investigation and improvement can be considered for the amplifier and ADC part, namely device level, circuit level, and system level. However, our research will consider only system-level techniques. The amplifier and ADC being considered in continuous and discrete time (digital) domains come with a handful of performance parameters. As will be shown later in this thesis, potentially most of these performance parameters of the amplifier and ADC will be affected by temporal variations. This will make the whole problem complex and unmanageable. However, our goal is to find and investigate only critical and important performance parameters that are usually application dependent. As discussed above, solutions to static variations already exist and are usually practiced at the design stage. However, the combined impact of manufacturing uncertainty (process variability) and temporal degradation (aging) results in time-dependent variability requires a new approach in system design solutions. In reality, as a result of time-dependent variations, the performance parameters of each system component will follow a statistical distribution. Some components will have the value of performance parameters higher than the mean value and some will have a lower value. This variation is usually not exploited in traditional techniques dealing with variations at the system level. New techniques are required in the design of the analog and mixed-signal systems-on-chip at system level to overcome these limitations. This means, using design-time tuning of the electronic systems it will be difficult to meet the performance constraints during the operational life. Therefore, the goal of our research is to exploit system-level techniques that can potentially be used during the operational life of electronic systems to enhance their dependability. The research goal of this thesis can be arranged in a number of questions as follows: 1) What type of hardware architecture can be used to address the technologyscaling related temporal-degradation (aging) issues in analog and mixedsignal (AMS) systems during their operational life? **5** C H A P T E R - 2) How can optimization be achieved among different dependability requirements and other issues like area, power, speed etc. in AMS systems? - 3) What type of improvements will be necessary in the hardware architecture to address the initial-value dependent degradation issues in AMS systems? - 4) What type of efficient methodologies can be used in order to indirectly estimate the performance of AMS systems during their operational life? - 5) What (additional) actions will be required to distinguish between timedependent variations and dynamic variations (i.e. long-term and short-term variations as explained in Chapter 5) in order to enhance the dependability of AMS systems? - 6) What type of alternative methods, as compared to conventional devicelevel simulations, can be used to analyze/investigate time-dependent variations/degradations in complex analog and mixed-signal systems (e.g. analog-to-digital converters)? The details of the corresponding answers to these research questions are analyzed and concluded in Chapter 7. The next section will summarize the presented approach to address the above research questions. #### 1.6 PRESENTED APPROACH The approach presented in this thesis is mainly theoretical and is based upon mathematical models or workflows developed during this research. However, a number of transistor-level simulations are carried out in order to extract the degradation information that is further used at system-level to analyze and investigate presented dependability improvement strategies. Initially, a hardware platform is proposed to address temporal-degradation issues in AMS systems in order to enhance the dependability. This hardware platform is further modified to address area overhead and other complexity issues. The dependability improvement of both of the proposed hardware platforms is then analyzed mathematically and compared against traditional approaches. In the next step, process-induced initial-value dependent degradation information is utilized to propose a workflow for better management of system dependability during operational life. Direct and indirect means of extracting system performance (later used to estimate the system reliability) during operational life are also included in this workflow. The proposed approach is mathematically developed and then simulated for a target system under a number of scenarios. Short-term and long-term environmental changes may lead to corresponding short-term (dynamic) and long-term (time-dependent) dependability issues. Therefore, in order to separate dynamic (short-term) variations from time-dependent (long-term) variations another workflow is presented, analyzed and later simulated for a target system under a number of environmental conditions. To simplify degradation analysis for complex analog and mixed-signal systems a behavioral model based approach is also presented and simulated for a charge-redistribution successive approximation register (SAR) ADC. In short, a number of methods in order to better analyze the dependability issues in AMS systems and corresponding dependability improvement strategies under time- dependent, initial-value dependent, and environment dependent degradation issues are presented. Furthermore, an optimization technique to overcome area overhead and corresponding compromises in power, speed, and dependability requirements is also presented. The next section will briefly discuss the overall organization of the presented approach in this thesis. #### 1.7 THESIS ORGANIZATION The thesis is organized as follows. The necessary background information required for this research work including state-of-the-art, their shortcomings and potential considerations in the current research work are briefly described in Chapter 2. The system-level dependability issues of a general purpose analog and mixed-signal front end and the corresponding dependability improvement hardware architecture are proposed and analyzed in Chapter 3. In addition, further improvements in the proposed dependability improvement hardware architecture to overcome area overhead are also presented in this chapter. An optimization technique, based on a library of dependable IPs (intellectual property) to select the best possible system modules (IPs) under required dependability levels, and area, power, speed issues is also the subject of Chapter 3. Issues related to initial-value dependent degradations, their consequences on system performance and further improvements in the proposed hardware platform to address these issues are presented in Chapter 4. To overcome complexity and loading issues in monitoring system performance due to temporal degradation has also been resolved in Chapter 4. This is accomplished by providing a novel technique for indirectly estimating the system performance using a set of degradation values over input stress conditions acquired via design-time simulations. The influence of dynamic (short-term) and temporal (long-term) variations, the method to differentiate between these two types of variations and potential benefits in improving the overall dependability of analog and mixed-signal systems are presented and discussed in Chapter 5. In order to resolve degradation analysis issues in complex analog and mixed signal systems, a system-level degradation analysis system for a charge-redistribution SAR ADC is presented in Chapter 6. It is based on incorporating circuit-level degradation information in system-level behavioral models. In Chapter 7, the overall contribution of the research presented in this thesis is summarized and possible limitations of the proposed methodology and potential future work in improving the dependability of analog and mixed-signal front ends are discussed. #### 1.8 REFERENCES [Avi01] A. Avizienis, J-C. Laprie, and B. Randell, "Fundamental concepts of dependability", in Laboratory for Analysis and Architecture of Systems (LAAS-CNRS) Technical Report no. 01-145, Apr. 2001. [Bra09] A. Bravaix, et al., "Hot-Carrier acceleration factors for low power management in DC-AC stressed 40nm NMOS node at high temperature," in IEEE Int. Reliability Physics Symposium, pp. 531-548, 2009. [Bru05] C. Bruynseraede, et al., "The impact of scaling on interconnect reliability," in IEEE Int. Reliability Physics Symposium, pp. 7-17, 2005. [Gro05] G. Groeseneken, R. Degraeve, B. Kaczer, and P. Roussel, "Recent trends in reliability assessment of advanced CMOS technologies," in IEEE Int. Conf. Microelectronic Test Structures (ICMTS), pp. 81-88, 2005. [Hall2] G. Halfacree, "Nvidia announces world's most complex GPU," in bit-tect news, 2012. http://www.bit-tech.net/news/hardware/2012/05/18/nvidia-gk110/1 [Jha05] N.K. Jha, P.S. Reddy, D.K. Sharma, and V.R. Rao, "NBTI degradation and its impact for analog circuit reliability," in IEEE Trans. Electron Devices, Vol. 52, No. 12, pp. 2609-2615, 2005 [Lew09] L.L. Lewyn, T. Ytterdal, C. Wulff, and K. Martin, "Analog Circuit Design in Nanoscale CMOS Technologies," in IEEE Proceedings, Vol. 97, No. 10, pp. 1687-1714, 2009. [Mae05] K. Maex, et al., "Technology aware design and design aware technology," in IEEE Int. Conf. Integrated Circuit Design and Technology (ICICDT), pp. 77-81, 2005. [Miy02] S. Miyazaki, "Characterization of high-k gate dielectric/silicon interfaces," in Applied Surface Science, Vol. 190, Issues 1–4, pp. 66-74, 2002. [Naj05] F.N. Najm, "On the need for statistical timing analysis," in IEEE Design Automation Conference (DAC), pp. 764-765, 2005. [Red02] V. Reddy, et al., "Impact of negative bias temperature instability on digital circuit reliability," In IEEE Reliability Physics Symposium, pp. 248-254, 2002. [Rib05] G. Ribes, et al., "Review on High-k Dielectrics Reliability Issues" in IEEE Trans. Device and Materials Reliability, Vol. 5, No. 1, pp. 5-19, 2005. [Sta01] J.H. Stathis, "Physical and predictive models of ultrathin oxide reliability in CMOS devices and circuits," in IEEE Trans. Device and Materials Reliability, Vol. 1, No. 1, pp. 43-59, 2001. [Tok05] Z. Tokei, Y. Li, and G. Beyer, "Reliability challenges for copper low-k dielectrics and copper diffusion barriers," in Journal of Microelectronics Reliability, pp. 1436–1442, 2005. ### BACKGROUND AND RELATED WORK ABSTRACT — This chapter presents the necessary background knowledge essential to understand the achievements in this research work. It starts with the concept and definition of dependability. Next it's most important attributes (reliability, maintainability, and availability) as well as the sources of impairments and the means to improve dependability are presented. Mathematical theory necessary to understand these dependability attributes is also explained. Physical mechanisms responsible for causing failures and degradations are also briefly discussed. At the end, a brief overview of previous research, their limitations and shortcomings as well as a summary of important issues addressed in this thesis in order to enhance the dependability of analog and mixed-signal systems at system level are described. #### 2.1 Introduction The rapid advances in microelectronic, computer and networking systems have resulted in their penetration into almost every aspect of our life. They are utilized to support the human activities and as a result these activities depend more and more on the services delivered by these systems. As the technology is scaling rapidly, the complexity of these systems is also getting higher and higher. Therefore, the chance of a fault that impedes the delivery of the service has become greater than ever. As a result, failure of these systems could result in a loss of time, money, damage to environment, or even human life in some critical applications. Therefore, the assessment and improvement of system dependability becomes a key step in the design, analysis, and tuning of such systems. The rest of the chapter is organized as follows. The definition of dependability, its attributes and the possible dependability impairments and means are presented in section 2.2. Among different attributes of dependability a number of attributes have been selected for the research work presented in this thesis. These attributes and their theory are being discussed in sections 2.3 and 2.4 respectively. Section 2.5 briefly describes different failure mechanisms that can be responsible for dependability degradation. A brief overview of best-practice methods, their limitations along with important considerations being addressed in this current research work in order to enhance the dependability of analog/mixed-signal systems are discussed in section 2.6. The conclusions and some important references are presented in sections 2.7 and 2.8 respectively. #### 2.2 DEPENDABILITY The term dependability is reported to have been first used as a technical term in 1960 by Hosford [Pra95]. Its early usage was confined to the notions of availability and reliability. In the meantime, J.C. Laprie expanded the term dependability as a wider concept in 1985 [Lap85] because the meaning of "reliability" that the fault-tolerance technology treated had broadened. From then on, dependability has been used in various fields to this day. Unfortunately, the term dependability has been assigned many different meanings in the literature. For example, in the case of computer-based systems, dependability has been defined as [Par88]: "the justifiable confidence the manufacturer has that it will perform specified actions or deliver specified results in a trustworthy and timely manner" Avizienis et al. defined the dependability of a system as [Avi04]: "the ability to deliver service that can be justifiably trusted" Where the service delivered by a system is its behavior as it is perceived by its user(s); a user is another system (human or physical) which interacts with the former. However, according to the classical definition of dependability it represents the property of the system that integrates attributes, like availability, reliability, maintainability, safety, integrity and confidentiality [Avi01, Avi04, Buj06]. Usually, dependability is classified into three fundamental characteristics: attributes, impairments and means. The dependability *attributes* are the system properties that are expected from a system. The dependability *impairments* represent the potential threats to dependability and the dependability *means* are the methods or techniques that can be used to build a dependable system. These characteristics are further discussed in the next sections. #### 2.2.1 DEPENDABILITY ATTRIBUTES The attributes of dependability are the system properties that can be expected as a requirement of its service to be delivered. There are a number of system properties that can be considered as the dependability attributes like reliability, availability, maintainability, safety, integrity and confidentiality [Avi01]. Depending on the application, one or more of these attributes may be needed to appropriately evaluate system behavior. For example, for a nuclear power plant system the reliability and safety will be the most important attributes. The different attributes of dependability can be defined as [Avi01]: **Availability:** This indicates the readiness of a system for its correct service. It is defined as the probability that a system will be available to correctly deliver its service at any given time. **Reliability:** This shows the capability of a system to provide the continuity of its correct service. It is defined as the probability that a system will be **Maintainability:** This indicates the capability of a system to undergo modifications and repairs. It is defined as the probability that a system will be repaired at any given time if it fails to deliver correct functionality. **Safety:** This is defined as the capability of the system to avoid catastrophic consequences on the users or the environment. **Integrity:** This is the capability of a system to avoid any alterations. It can be further classified in to two types: - **1. System integrity** which defines the ability of a system to detect faults in its own operations and to inform a human operator. - **2. Data integrity** which defines the ability of a system to prevent damage to its own database and to detect, and possibly correct, errors that occur as a consequence of faults. **Confidentiality:** This is the capability of a system to prevent the unauthorized disclosure of information. Among the above attributes of dependability, only the first three namely the availability, reliability and maintainability will be considered in the research work presented in this thesis as discussed in section 2.3. #### 2.2.2 DEPENDABILITY IMPAIRMENTS The impairments of dependability are the reasons that could prevent a system from correct functioning [Avi01]. Usually, they are described in terms of faults and failures. A system failure is an event that occurs when the delivered service deviates from the correct service. A failure is thus a transition from the correct service to the incorrect service, i.e. not implementing the system function. Faults are the basic cause that can lead to failures in the system. Faults can occur as a result of numerous problems at the specification, implementation or fabrication level. These could be external factors as well; including environmental disturbances or human actions, either accidental or deliberate #### 2.2.3 DEPENDABILITY MEANS The means of dependability are the methods and techniques that facilitate the development of dependable systems [Avi01]. Usually, they are classified into four types. Fault prevention is the method that is used to prevent the occurrence or introduction of faults. For example, shielding and radiation hardening to prevent radiation-induced faults and training and maintenance to prevent user-induced faults. Fault tolerance is the method that is used to develop a system so that they function correctly in the presence of faults. Usually, fault tolerance is achieved by using some sort of redundancy. Fault removal is the method that is used to reduce the number of faults which are present in the system. It is normally achieved by verification, diagnosis, and correction procedures performed during the system development phase or by 11 C H A P T E R 2 corrective and preventive maintenance procedures performed during the system operational life. *Fault forecasting* is the method that is used to estimate current faults, possible future fault occurrences and the consequences of faults. It is normally performed by evaluating the system behavior with respect to fault occurrences either qualitatively, quantitatively or probabilistically. ## 12 #### 2.3 SELECTED DEPENDABILITY ATTRIBUTES As described in Chapter 1, the aim of this research is to explore and investigate new techniques that can be used to enhance the dependability of analog and mixed-signal front ends. However, as described above, the dependability of a system based on its application is a collection of system properties like availability, reliability, maintainability, safety, integrity and confidentiality which are expected from a system. Therefore, one or more of these properties will be required to appropriately evaluate the dependability of analog and mixed-signal front ends. Being an important part of most critical systems, the dependability of analog and mixed-signal systems requires that they should always be available for functionally correct service and be maintained/repaired as quick as possible. Therefore, among different attributes of dependability the reliability, maintainability, and availability will be the focus of current research presented in this thesis. #### 2.4 DEPENDABILITY THEORY In this section, the necessary mathematical theory behind the reliability, availability and maintainability will be discussed. These theories are stated in terms of the mathematics of probability and statistics because of the inherent degree of uncertainty in predicting a failure and associated repair actions. Hence the uncertainties of failures or repairs are given in percentages or probability that a given part will fail or can be repaired in a specified time. #### 2.4.1 RELIABILITY THEORY As mentioned in section 2.1, the reliability can be defined in terms of probability as "the probability that a system will be available to correctly deliver its service at any given time", mostly under predefined conditions. The general expression for the reliability function is given by [Kan11]: $$R(t) = e^{-\lambda t} \tag{2.1}$$ where ' $\lambda$ ' represents the constant failure rate which is defined as the number of failures per unit time. Equation 2.1 is frequently used in reliability analysis, particularly for electronic systems. This is also known as the *exponential failure law* [Kan11]. Several other commonly-used reliability concepts involve mean-time-between-failures (*MTBF*) and mean-time-to-failure (*MTTF*). Usually *MTBF* is the predicted elapsed time between inherent failures of a system during operation [Eus08, Wik14a]. 2 13 Figure 2.1: Time-between-failures (TBF) and time-to-repair (TTR) for a repairable system. This is the expected value of the arithmetic mean (average) time between failures of a system. This concept is typically defined for repairable systems (finite repair time) where a failed system is repaired as a part of a renewal process. However, *MTTF* is typically defined for non-repairable systems (infinite repair time) and this represents the expected value of the average time to failure of a system [Eus08, Wik14a]. Figure 2.1 shows the "Up" and "Down" state of a repairable system. The time spent in "Up" state between the two consecutive "Down" states is defined as the time-between-failures (*TBF*). Therefore, mathematically for repairable systems the *MTBF* can be defined as: $$MTBF = \frac{\sum_{i=1}^{n} TBF_i}{n} \tag{2.2}$$ where $$\sum_{i=1}^{n} TBF_i = \text{total operating (up)time and}$$ $$n = \text{total number of failures}$$ This means 1/MTBF represents the total number of failures per unit time or as previously defined the failure rate ( $\lambda$ ). Therefore, equation 2.1 in the case of constant failure rate can be rewritten as: $$R(t) = e^{-t/MTBF} (2.3)$$ This is another well-known equation used in reliability world. This equation gives the reliability of the system in terms of probability; '1' being the highest probability (highly reliable) and '0' being the lowest probability (highly unreliable). An important conclusion that can be drawn from this equation is that at any time 't' the reliability R(t) of the system is directly related to the MTBF value. The larger the value of MTBF is the higher the reliability R(t) of the system will be and vice versa. Furthermore, if 'iTTF(t)' represents the instantaneous-time-to-failure, defined as the remaining time before the failure occurs at any time 't', as shown in Figure 2.2 then the reliability of a repairable system at that time can be approximated by: $$R(t) \cong iTTF(t) \tag{2.4}$$ This equation gives the *quantitative* value of the reliability for a repairable system MTBF and 'zero' being the highest and lowest reliability values respectively. In other Figure 2.2: Instantaneous-time-to-failure (iTTF(t)) for a repairable system. words, the larger the value of time before the system failure occurs the higher the reliability of the repairable system will be and vice versa. Equation 2.4 is further used in Chapter 4 to calculate the reliability of a repairable system during its operational life. #### 2.4.2 MAINTAINABILITY THEORY The main purpose of the maintainability is to design a system such that it can be repaired if a failure occurs. As mentioned in section 2.1, the maintainability can be defined in terms of probability as "the probability that a system will be repaired at any given time if it fails to deliver the correct functionality". Mathematically it is usually expressed as [Rel14b]: $$M(t) = 1 - e^{-\mu t} (2.5)$$ where ' $\mu$ ' represents the constant repair rate which is defined as the number of repairs per unit time. The concept of maintainability is typically defined for repairable systems and is usually related to mean-time-to-repairs (MTTR). Where MTTR is usually defined as the expected value of the mean (average) time required to repair a failure in a repairable system [Eus08, Wik14b]. Figure 2.1 shows the time-to-repair ( $TTR_i$ ), defined as the time required to repair a system when $i^{th}$ failure occurs, for a repairable system. Therefore, mathematically the MTTR can be defined as: $$MTTR = \frac{\sum_{i=1}^{m} TTR_i}{m} \tag{2.6}$$ where $$\sum_{i=1}^{m} TTR_i = \text{total repair (down)time and}$$ $$m = \text{total number of repairs}$$ This means 1/MTTR represents the total number of repairs per unit time or as previously defined the repair rate ( $\mu$ ). Therefore, equation 2.5 in the case of constant repair rate can be rewritten as: $$M(t) = 1 - e^{-t/MTTR} (2.7)$$ This is the general expression for the maintainability function. This equation gives the maintainability of a repairable system in terms of probability; '1' being the highest (14) probability (highly maintainable) and '0' being the lowest probability (highly unmaintainable). #### 2.4.3 AVAILABILITY THEORY Similar to maintainability, the availability concept is usually used for repairable systems that are required to operate continuously, i.e., round the clock. A system, at any random point in time, can be either operating (up) or "down" because of a failure as shown in Figure 2.1. Therefore, in this original concept a repairable system is considered to be in only two possible states - operating or in repair. In this way, the availability is defined as the probability that a system is operating satisfactorily at any random point in time 't', when subject to a sequence of "up" and "down" cycles (Figure 2.1). Mathematically, availability can be expressed as: $$A = 1 - \frac{Down \, Time}{Total \, Time} = \frac{Up \, Time}{Total \, Time} = \frac{Up \, Time}{Up \, Time + Down \, Time}$$ (2.8) Using equations 2.2 and 2.6, equation 2.8 can be rewritten as: $$A = \frac{n \times MTBF}{n \times MTBF + m \times MTTR} \tag{2.9}$$ In the case, the total number of failures 'n' and total number of repairs 'm' are equal then the above equation reduces to: $$A = \frac{MTBF}{MTBF + MTTR} \tag{2.10}$$ This means, availability is a combination of reliability (MTBF) and maintainability (MTTR) parameters. This equation is the general expression for the availability function which is frequently used in literature. Furthermore, by incorporating the concept of instantaneous-time-to-failure (*iTTF*), as defined above, the availability of a repairable system at any time 't' during its operational life can be approximated by: $$A(t) \cong \frac{iTTF(t)}{iTTF(t) + MTTR} \tag{2.11}$$ This equation is further used in Chapter 4 to calculate the availability of a repairable system during its operational life. #### 2.5 DEGRADATION MECHANISMS AND SYSTEM DEPENDABILITY As a result of continuous aggressive scaling of technology in terms of device dimensions, increasing electric fields and the usage of new materials to meet the demands set by these technologies, the reliance on electronic systems fabricated in these technology nodes has become a very important aspect. There are various degradation 15 C H A P T E R 2 mechanisms that can degrade the performance of devices, circuits and their associated electronic systems as a result of this aggressive technology scaling. In this section some of the important degradation mechanism that include bias temperature instability (BTI), hot carrier injection (HCI), time dependent dielectric breakdown (TDDB), and electromigration (EM) are briefly introduced. ## 16 #### 2.5.1 BIAS TEMPERATURE INSTABILITY The bias temperature instability (BTI) is a degradation mechanism that occurs in MOS devices as a result of interface traps between the gate oxide and silicon substrate at elevated temperatures ( $30^{\circ}C$ to $200^{\circ}C$ ) [Str09] and hence degrade the dependability of electronic devices. This degradation mechanism results in device threshold voltage ( $V_{th}$ ) shift and loss of drive current ( $I_{on}$ ). The BTI effect is more severe for pMOSFETs than nMOSFETs due to the presence of holes in the PMOS inversion layer that are known to interact with the oxide states. The highest impact of BTI in pMOSFETs is observed if stressed with a high negative gate voltage at elevated temperatures [Ent07]. It is referred to as negative BTI (NBTI) due to the negative gate to source voltage. In pMOSFETs, the channel holes interact with passivated hydrogen bonds in the dielectric resulting into generation of traps and interface states. This results in an increase in threshold voltage ( $V_{th}$ ) value and the effect increases at high temperatures. The introduction of new dielectric material like high-k has increased the BTI effect in nMOSFETs and this is referred to as positive bias temperature instability (PBTI) due to the positive gate-to-source voltage. It has been noticed that BTI degradation starts decreasing very quickly after the removal of the stress. This recovery process is caused by de-trapping of charge during subsequent removal of stress signal after a stress phase [Gra07]. The stress signal causing BTI degradation can be of two types; the static stress (DC stress) and the dynamic stress (AC stress). The AC stress is known to be beneficial for lifetime enhancement because it can introduce the recovery process mentioned above [Nig06, Che03, Aba03]. Recovery after NBTI or PBTI stress in MOSFETs and its dependence on gate voltage, temperature and frequency of stress signal has been a hot topic of research in the past decade [Rei10]. Currently, BTI is one of the most serious and important reliability concerns for both digital and analog/mixed-signal CMOS circuits. At advanced technology nodes this effect is enhanced due to reduced voltage headroom, high oxide electric fields resulting from non-constant field scaling, high temperatures due to higher power dissipation and introduction of new dielectric materials. #### 2.5.2 HOT CARRIER INJECTION Hot carrier injection (HCI) degradation has been as an important failure mechanism for the last three decades and still remains important in new technologies. HCI occurs when an "electron" or "hole" gains sufficient kinetic energy to overcome a potential barrier and breaks the interface state in MOS devices. These charge carriers can become trapped in the gate dielectric and hence permanently change the transistor characteristics [Mar13, Wik14c]. Therefore, HCI will degrade the electrical characteristics of MOSFETs and hence the dependability of associated electronic systems. Time-dependent dielectric breakdown (TDDB) is a degradation phenomenon that occurs in the thin insulating layer $SiO_2$ between the control "gate" and the conducting "channel" of the transistor. The general belief is that TDDB of gate insulating material results from the cumulative effect of insulator trapped charge buildup during short-term and long-term high-field stress. High trapped-charge-induced local fields build up within the insulator which creates defects in the oxide film. These defects accumulate with time and eventually reach a critical density, triggering a sudden loss of dielectric properties [Sta01]. These defects can also cause gate leakage and excess noise in MOSFETs. A surge of current produces a large localized rise in temperature, leading to permanent structural damage in the $SiO_2$ . This will create failures in MOSFETs and hence the dependability of associated electronic systems will degrade. #### 2.5.4 ELECTRO-MIGRATION Electro-migration (EM) is the dominant failure mode of interconnects that results from aggressive interconnect scaling. As the technology is scaling, the device density is increasing and as a result the interconnects that carry signals are consequently reduced in size, specifically, in height and cross section. This leads to extremely high current densities, in the order of at least 106 A/cm2 and associated thermal effects, which can cause reliability and hence dependability problems [Sco91]. #### 2.6 ANALOG AND MIXED SIGNAL DEPENDABILITY IMPROVEMENTS With the introduction of nano-scale CMOS technologies different degradation mechanisms, as described above, can have a big impact on the lifetime of electronic systems; this is especially true in safety-critical systems running under harsh environments for a long time. Usually these degradation mechanisms have negligible effects on consumer devices (like mobile phones) running under normal environmental conditions. Therefore in these technology nodes, the designers of analog and mixed-signal systems, running under harsh environmental conditions, are faced with many new challenges at different phases of the design. This section sheds light on some of the previous work that has been done in order to improve the dependability (mostly only reliability) of analog and mixed signal systems. Section 2.6.1 gives the briefly history of different degradation mechanisms. Different efforts made at device level, design level and simulation-tool level along with some of the examples from recent practices to analyze and mitigate these degradation effects are given in section 2.6.2. The corresponding shortcomings and limitations of these practices along with some of the important issues addressed in this research work are presented in section 2.6.3. ## 2.6.1 Brief History Different degradation mechanisms like NBTI, HCI, TDDB, and EM were first discovered by device scientists in the seventies and eighties. At that time most of the 17 C H A P T E R efforts were spend in order to understand these degradation mechanisms. Later on, in the nineties the scientists started to investigate the circuit behavior under these degradation mechanisms. They started measuring the behavior of individual transistors under high temperatures and elevated stress voltages and used this information to determine the design margins of their circuits. At that time most of the efforts were made on the technology-process side to limit the degradation effects at device level; very little effort was spend to mitigate their effect at the circuit and system level. ## 18 #### 2.6.2 RECENT PRACTICES In recent times, new tools have been developed to study the effect of different degradation mechanisms (mentioned above) on individual devices and associated electronic circuits [Hon09]. Mostly the efforts have been made at design-level simulations to analyze the effect of different degradation mechanisms on the behavior of circuits. Unfortunately, only the reliability, being an important attribute of dependability, has been analyzed mostly in these simulations. Therefore, these simulations are usually referred to as *reliability simulations*. In these simulations, degradation models of devices, based on technology information, are used to simulate the degraded or lifetime behaviours to establish circuit reliability as a function of time. This information is then further used to redesign or incorporating circuit strategies for improving reliability. Different efforts that have been conducted at device, design and simulation-tool level are further discussed in the following sections. #### 2.6.2.1 DEVICE-LEVEL EFFORTS The device lifetime of nanometer CMOS technology is reducing with each next technology generation due to reliability issues. Therefore, with the increasing complexity of systems and their interactions it is becoming very important to know how to identify reliability issues and being able to alleviate the problem. Some of these reliability problems can and are being solved at the device level. For example, in [Par96] low energy arsenic implant is applied just after deep n-LDD (lightly doped drain) implant to reduce the hot-carrier degradation effects in nMOSFET structures. However, a technology-based solution is not always possible, especially in case the focus of device engineers is typically on developing smaller and faster transistors with lower power consumption. #### 2.6.2.2 DESIGN-LEVEL EFFORTS Reliability problems during the design stage are usually solved by using a design for reliability (DFR) flow. This flow is based upon a set of tools that support the product and process design cycle to ensure the customer reliability expectations throughout the product life. The key activities at different stages of DFR, proposed by Reliasoft [Rel14a] are briefly explained below. **Define:** The key reliability requirements, goals and environmental and usage conditions (mission profile) for a product are clearly and quantitatively defined at this stage. **Identify:** In this stage the possible reliability threats are identified. In order to quantify the risk associated with different failure effects, a failure mode effect analysis (FMEA) strategy can be used as a tool at this stage. Analyse and Assess: During this stage, the product lifetime is estimated and expressed as the mean time to failure (MTTF). The product lifetime is mainly estimated based on accelerated stress tests on individual devices. Designers usually use large design margins because the circuit lifetime (MTTF) is hard to estimate. This usually results in larger area and higher power consumption even with large uncertainties about the lifetime of a circuit. Therefore, accurate transistor models for all important transistor unreliability effects and efficient circuit simulation techniques to i) analyse the reliability of a circuit and ii) to identify circuit reliability weak spots, can be used to reduce these problems. Quantify, Improve and Validate: During this stage, actual measurement results are used to verify the simulation results obtained from the previous stages. These actual measurements are typically done on prototype circuits using accelerated life tests (ALT). Usually, existing validation techniques are very slow. Therefore, to quickly identify product defects and to solve them effectively, highly accelerated life tests (HALT) and highly accelerated stress tests (HAST) are used [Rel14a]. The failure modes identified by these tests, together with other product requirements and the data collected from field returns, are used to identify the root causes which are further used as an input for developing new products. This process is repeated until the product is considered to be acceptable under stated specifications. **Monitor and Control:** Once a product is in production, the process is monitored to assure that process variations are kept under control and that reliability is still guaranteed. A brief overview of the efforts made at the simulation-tool level is discussed in the next section. #### 2.6.2.3 SIMULATION-TOOL LEVEL EFFORTS Simulation tools, that are an essential part of the previous "Analyse and Access", have been started in the late 80's and 90's to simulate degradation effects in circuits. During a typical reliability simulation flow, at first, a circuit simulation is run using fresh device models as usual. This generates the fresh voltage and current waveforms for every transistor which are further used to generate degraded models for every transistor. These degraded models are then used for a second pass of simulation, to predict the behaviour of aged circuit [Liu06]. Initially, tools like HOTRON [Aur87], RELY [She89], and BERT [Tu93] were used to study the degradation effects caused by electro-migration and hot carrier mechanisms. The Berkeley Reliability Tools (BERT) is one of the famous tools used till today that is composed of a set of methods developed by Hu et al. at the University of California Berkeley in the early 90's [Tu93]. The toolset can be used to simulate the impact of hot-electron degradation in MOSFETs and bipolar transistors. Furthermore, prediction of circuit failure due to oxide breakdown or electro-migration in CMOS, bipolar and BiCMOS is also supported. The toolset is written around a commercial circuit simulator such as SPICE. Although rather old, a lot of modelling and simulation elements used in BERT are still applied in modern commercial reliability simulators. Among other famous commercially available reliability simulators is the Mentor Graphics Eldo reliability simulator [Des09]. This tool provides information about circuit performance degradations due to gradual transistor aging effects. Abrupt effects such as dielectric breakdown are therefore not supported. The Cadence reliability simulator, called RelXpert, provides the simulation and analysis of the impact of gradual aging effects such as HCI and NBTI in Virtuoso Ultrasim and in the Analog Design Environment (ADE) [Liu06]. Similarly, this tool cannot be used for abrupt effects such as dielectric breakdown and therefore it has similar capabilities and limitations as compared to the Mentor Graphics Eldo reliability simulator. The Synopsis MOS reliability analysis (MOSRA) is another commercially available reliability simulator that can also be used to see the impact of HCI and BTI effects on integrated circuits [Syn11]. Again, only gradual aging effects are supported in this tool. The next two sections give some of the examples from previous research work where the effect of these degradation mechanisms on an individual device and the corresponding circuits is either analysed or a solution to mitigate their effects has been discussed. #### 2.6.2.4 DEGRADATION ANALYSIS EXAMPLES In this section some previous degradation analysis research work is discussed. For example, the impact of NBTI and HCI degradation on the reliability of MOSFETs is thoroughly investigated in [The00]. Whereas, the impact of NBTI induced degradation mismatch in operational amplifier circuits has been discussed in [Ago04]. Similarly, circuit-level aging simulations under NBTI effect on current mirror, operational amplifier, comparator and digital-to-analog converter (DAC) circuits are discussed in [Jha05]. NBTI and HCI degradation effects on the performance degradation of RF and analog circuits are discussed in [Fer09, Wan07a, Hua07, Rub05, Sch03]. The influence of BTI and process variability on the functionality of different configurations of an amplifier circuit is investigated in [Mar09]. The research work conducted in all of the above examples only gives the reliability information under the specified stress conditions for a selected duration of time. Moreover, in all of these examples only the degradation models of devices are used to simulate the degraded or lifetime behaviours to establish circuit reliability as a function of time. However, complete circuit dependability (reliability, maintainability, availability etc.) information cannot be extracted from these simulation results. #### 2.6.2.5 DEGRADATION MITIGATION EXAMPLES Although most of the research efforts have been performed to analyse the effect of degradation mechanisms, very limited research has been carried out on the countermeasures to compensate and overcome the performance degradation in analog and mixed-signal systems. For example, in [Kri03] an optimum operating voltage to balance NBTI degradation against transistor voltage headroom is presented. A number of passive techniques, like burn-in and calibration, are used in [Cho11a, Cho11b] to compensate the aging-induced offset in different structures. Body biasing and CΗ > A P Τ E differential matching to actively compensate the NBTI effect are presented in [Gho10]. A trimming method to reduce the offset voltage of an SRAM sense amplifier is given in [Kaw10]. A boosted gain programmable OpAmp is presented in [Wan11b] in order to mitigate the aging effects due to NBTI effects. In [Mak10] the use of elevated $V_{DD}$ and thick oxide transistors to boost the performance of analog and mixed-signal circuits is #### 2.6.3 SHORTCOMINGS AND LIMITATIONS OF CURRENT PRACTICES In this section, the shortcomings and limitations of the above mentioned already practiced methods are briefly described as below. The important thing to note here is that we are only considering analog and mixed-signal systems in this research work. Therefore, the digital systems are not a part of this discussion. - 1) The first and most important limitation is the inability to consider all attributes of dependability, or at least the most important ones, in these practised methods. Mainly reliability has been considered both at the device and design level analysis and mitigations. - 2) Most efforts have been made to analyse and mitigate individual devices or small individual circuits like current mirror, operational amplifier, comparator etc. However, complete systems like analog and mixed-signal front-ends composed of operational amplifiers and analog-to-digital converters (present case) are rarely considered. - 3) The proposed reliability (only one attribute of dependability) mitigation techniques are mostly considered for very specific designs with very specific degradation problems. No generic solutions are presented. - 4) All of the degradation issues are not considered simultaneously. The failure of any circuit or system as a result of all of the degradation mechanisms and the corresponding mitigation techniques, especially related to availability (another attribute of dependability) of the system, are rarely considered. - 5) Reliability analysis and corresponding solutions of an analog-to-digital converter (ADC) are only partially considered [Bao09]. - If reliability is considered for any device or circuit then it is normally considered, analysed and resolved at the design stage. The reliability issues and corresponding solutions during their operational life are never considered in the current practices. - 7) The reliability analysis tools are usually very slow [Che14] and sometime it is very difficult to simulate larger and complex designs. Therefore, only small circuits are normally considered and analysed in these simulations [Jha05]. R 2 - 8) Some of the robust techniques practiced for small systems like amplifiers and comparators are becoming too expensive in terms of area, design time, design complexity and non-generic issues [Wan11b]. - 9) Usually, these reliability simulation results are based upon worst-case device stress tests that might be completely different from real environment stress conditions [Pau06]. Therefore, the lifetime estimations and other extracted results could not be realistic. - 10) Two important stress variables, the working-stress voltage and the working-stress temperature, might change for short time duration. Therefore, the corresponding changes will be completely different from the changes resulting from continues stress conditions. This differentiation is not considered in current degradation analysis and mitigation practises. - 11) Fabrication-related process induced degradations are being considered in recently practiced methods, however, process-induced initial-variation dependent degradations are not considered. The next section will briefly discuss which of the issues described above are considered or not considered during the current research work presented in this thesis. #### 2.6.4 Considerations in the Current Research In this section, the above issues will be discussed one by one to see to what extent they have been considered in our research work. - 1) The following important attributes of dependability being reliability, availability and maintainability are considered in this research work. - 2) A general analog and mixed-signal front end composed of amplifier and analog-to-digital converter is considered (Chapter 3). - 3) Optimized generic solutions at design-level and during operational life are being investigated (Chapter 3). - 4) Only NBTI related degradations issues have remained the main focus of current research. Other degradations mechanisms are also not considered in this research work. - 5) Degradation issues in the static and dynamic parameters of an ADC are considered. However, degradations in the selected parts of the ADC, specific to a particular architecture, have been considered only (Chapter 6). - 6) The focus of the current research work has remained to present methods and techniques that can potentially be used to improve dependability during the operational life of analog/mixed-signal front ends. - 7) A faster degradation analysis technique, based on behavioural models, is presented for complex and larger systems (Chapter 6). - 8) Digitally-assisted analog and mixed-signal systems are proposed instead of robust systems. However, area overhead, complexity and implementation problems still remain as an issue. - 9) Dynamic stress conditions and corresponding degradation effects are usually considered in the proposed strategies to have realistic results. working-stress temperatures (Chapter 5). **CONCLUSIONS** 2.7 effects have been considered and resolved at system level (Chapter 4). 10) Special attention has been paid to differentiate and mitigate short-term and long-term degradation effects due to changing working-stress voltages and 11) Fabrication-related process-induced initial-value dependent degradation In this chapter the classical definition of dependability, one of the important design parameters for safety-critical systems, has been introduced and explained as an umbrella of essential system properties like availability, reliability, maintainability, safety, integrity and confidentiality. The main causes that can be potential threats to the dependability of an electronic system and the possible ways to overcome those threats are also briefly discussed. The mathematical formulation and the corresponding theory behind the important dependability attributes like reliability, maintainability, and availability are briefly introduced. The physical mechanisms that are responsible for degradation and failures in devices and associated systems are also concisely described. The state-of-art in improving dependability, mainly reliability, its limitations and shortcomings along with important considerations being taken in this current research work are also described. It forms the basis of research in the following chapters. #### 2.8 REFERENCES [Aba03] W. Abadeer, and W. Ellis, "Behavior of NBTI under AC Dynamic Circuit Conditions," in Proc. IEEE Int. Reliability Physics Symposium (IRPS), pp. 17-22, 2003. [Ago04] M. Agostinelli, et al., "PMOS NBTI-induced circuit mismatch in advanced technologies," in Proc. IEEE Int. Reliability Physics Symposium, pp. 171-175, 2004. [Aur87] S. Aur, D.E. Hocevar, and B.S.P. Yang, "Circuit hot electron effect simulation," in Int. Electron Devices Meeting, Vol. 33, pp. 498-501, 1987. [Avi01] A. Avizienis, J.-C. Laprie, and B. Randell, "Fundamental concepts of dependability", in Laboratory for Analysis and Architecture of Systems (LAAS-CNRS) Technical Report no. 01-145, Apr. 2001. [Avi04] A. Avizienis, J.-C. Laprie, B. Randell, and C. Landwehr, "l," in IEEE Trans. Dependable and Secure Computing, Vol. 1, No. 1, pp. 11-33, 2004. [Bao09] Y. Baoguang, F. Qingguo, J.B. Bernstein, Q. Jin, and D. Jun, "Reliability Simulation and Circuit-Failure Analysis in Analog and Mixed-Signal Applications," in IEEE Trans. Device and Materials Reliability, Vol. 9, No. 3, pp. 339-347, 2009. [Buj06] G. Buja and R. Menis, "Conceptual frameworks for dependability and safety of a system", in Proc. IEEE Int. Symp. Power Electronics, Electrical Drives, Automation and Motion, pp. 44-49, May 2006. [Che03] G. Chen, et al., "Dynamic NBTI of PMOS Transistors and its Impact on Device Lifetime," in Proc. IEEE Int. Reliability Physics Symposium (IRPS), pp. 196-202, 2003. C Η A P Т E R 2 [Che14] J. Chen, S. Wang, and M. Tehranipoor, "Critical-reliability path identification and delay analysis," in ACM Journal on Emerging Technologies in Computing Systems (JETC), Vol. 10, No. 2, pp. 12:1-12:21, 2014. [Cho11a] F. Chouard, S. More, M. Fulde, and D. Schmitt-Landsiedel, "An Analog Perspective on Device Reliability in 32nm High-k Metal Gate Technology," in IEEE Int. Symp. Design and Diagnostics of Electronic Circuits Systems (DDECS), pp. 65–70, 2011. [Cho11b] F. Chouard, M. Fulde, and D. Schmitt-Landsiedel, "An Aging Suppression and Calibration Approach for Differential Amplifiers in Advanced CMOS Technologies," in Proc. IEEE European Solid-State Circuits Conference (ESSCIRC), pp. 251-254, 2011. [Des09] C. Desclèves, M. Hagan, and W. Wang, "Addressing IC-Reliability Issues," Mentor Graphics Datasheet, 2009. [Ent07] R. Entner, "Modeling and Simulation of Negative Bias Temperature Instability," Ph.D. dissertation, Technischen Universität Wien, 2007. [Eus08] I. Eusgeld, F.C. Freiling, and R. Reussner, "Dependability Metrics," Springer Publishing Press, ISBN 978-3-540-68946-1, 2008. [Fer09] P. Ferreira, H. Petit, and J.-F. Naviner, "CMOS 65 nm Wideband LNA Reliability Estimation," in Joint IEEE North-East Workshop on Circuits and Systems and TAISA Conference (NEWCAS-TAISA), pp. 1-4, 2009. [Gho10] A. Ghosh, R. Franklin, and R. Brown, "Analog Circuit Design Methodologies to Improve Negative-Bias Temperature Instability Degradation," in Int. Conf. VLSI Design (VLSID), pp. 369-374, 2010. [Gra07] T. Grasser, et al., "Simultaneous Extraction of Recoverable and Permanent Components Contributing to Bias-Temperature Instability," in IEEE Int. Electron Devices Meeting (IEDM), pp. 801-804, 2007. [Hon09] L. Hong, W. Yu, L. Rong, and Y. Huazhong, "Software tools for analysing NBTI-induced digital circuit degradation," in Journal of Electronics (China), Vol. 26, No. 5, pp. 715-719, 2009. [Hua07] V. Huard, et al., "Design-in-Reliability Approach for NBTI and Hot-Carrier Degradations in Advanced Nodes," in IEEE Trans. Device and Materials Reliability, Vol. 7, No. 4, pp. 558-570, 2007. [Jha05] N. Jha, P. Reddy, D. Sharma, and V. Rao, "NBTI Degradation and Its Impact for Analog Circuit Reliability," in IEEE Trans. Electron Devices, Vol. 52, No. 12, pp. 2609-2615, 2005. [Jha05] N.K. Jha, P.S. Reddy, D.K. Sharma, and V.R. Rao, "NBTI Degradation and Its Impact for Analog Circuit Reliability," in IEEE Trans. Electron Devices, Vol. 52, No. 12, pp. 2609-2615, 2005. [Kan11] N. Kanekawa, E.H. Ibe, T. Suga, and Y. Uematsu, "Dependability in Electronic Systems," Springer Publishing Press, ISBN 978-1-4419-6714-5, 2011. [Kaw10] A. Kawasumi, et al., "A Low-Supply-Voltage-Operation SRAM With HCI Trimmed Sense Amplifiers," in IEEE Journal Solid-State Circuits, Vol. 45, No. 11, pp. 2341-2347, 2010. [Kri03] A. Krishnan, et al., "NBTI Impact on Transistor and Circuit: Models, Mechanisms and Scaling Effects [MOSFETs]," in IEEE Int. Tech. Digest Electron Devices Meeting (IEDM), pp. 14.5.1-14.5.4, 2003. [Lap85] J.-C. Laprie, "Dependable Computing and Fault Tolerance: Concepts and Terminology," in IEEE Int. Symp. Fault-Tolerant Computing, pp. 2-11, 1985. [Liu06] Z. Liu, B.W. McGaughy, and J.Z. Ma, "Design tools for reliability analysis," in IEEE Design Automation Conference (DAC), pp. 182-187, 2006. [Mak10] P. Mak and R. Martins, "High-Mixed-Voltage RF and Analog CMOS Circuits Come of Age," in IEEE Circuits and Systems Magazine, Vol. 10, No. 4, pp. 27-39, 2010. [Mar09] J. Martin-Martinez, R. Rodriguez, M. Nafria, and X. Aymerich, "Time-Dependent Variability Related to BTI Effects in MOSFETs: Impact on CMOS Differential Amplifiers," in IEEE Trans. Device and Materials Reliability (TDMR), Vol. 9, No. 2, pp. 305-310, 2009. [Mar13] E. Maricau and G. Gielen, "Analog IC Reliability in Nanometer CMOS," Springer Publishing Press, ISBN 978-1-4614-6162-3, 2013. [Nig06] T. Nigam, and E.B. Harris, "Lifetime Enhancement under High Frequency NBTI measured on Ring Oscillators," in Proc. IEEE Int. Reliability Physics Symposium, pp. 289-293, 2006. [Par88] B. Parhami, "From defects to failures: a view of dependable computing," in Journal ACM SIGARCH Computer Architecture News, Vol. 16, No. 4, pp. 157-168 1988. [Par96] H.S. Park, et. al., "Impact of Profiled LDD Structure on Hot Carrier Degradation of nMOSFET's," in IEEE Int. European Solid State Device Research Conference, (ESSDERC), pp. 991-994, 1996. [Pau06] B.C. Paul, K. Kang, H. Kufluoglu, M.A. Alam, and K. Roy, "Temporal Performance Degradation under NBTI: Estimation and Design for Improved Reliability of Nanoscale Circuits," in IEEE Proc. Design, Automation & Test in Europe (DATE), pp. 1-6, 2006. [Pra95] D. Prasad, J. McDermid, and I. Wand, "Dependability terminology: similarities and differences," in IEEE Conf. Computer Assurance, pp. 213-221, 1995. [Rei10] H. Reisinger, T. Grasser, K. Hofmann, W. Gustin, and C. Schlunder, "The Impact of Recovery on BTI Reliability Assessments," in IEEE Int. Integrated Reliability Workshop (IRW), pp. 12-16, 2010. [Rel14a] Design for reliability: Overview of the process and applicable techniques, Jan 2014, http://www.reliasoft.com/newsletter/v8i2/reliability.htm [Rel14b] Maintainability theory on the reliability analytics blog, Jan 2014, http://www.reliabilityanalytics.com/blog/2011/09/03/maintainability-theory/ [Rub05] M. Ruberto, et al., "Consideration of age degradation in the RF performance of CMOS radio chips for high volume manufacturing," in IEEE Symp. Radio Frequency Integrated Circuits (RFIC), pp. 549-552, 2005. [Sch03] C. Schlunder, et al., "On the Degradation of p-MOSFETs in Analog and RF Circuits under Inhomogeneous Negative Bias Temperature Stress," in Proc. IEEE Int. Reliability Physics Symposium (IRPS), pp. 5-10, 2003. [Sco91] A. Scorzoni, B. Neri, C. Caprile, and F. Fantini, "Electromigration in thin-film interconnection lines: Models, methods and results," in Materials Science Reports, Vol. 7, pp. 143-220, 1991. [She89] B.J. Sheu, W.-J. Hsu, and B.W. Lee, "An integrated-circuit reliability simulator-RELY," in IEEE Journal Solid-State Circuits, Vol. 24, No. 2, pp. 473-477, 1989. [Sta01] J.H. Stathis, "Physical and predictive models of ultrathin oxide reliability in CMOS devices and circuits," in IEEE Trans. Device and Materials Reliability, Vol. 1, No. 1, pp. 43-59, 2001. [Str09] A.W. Strong, et al., "Reliability Wearout Mechanisms in Advanced CMOS Technologies," Wiley-IEEE Press, ISBN: 978-0471731726, 2009. [Syn11] Estimating Circuit Lifetime, Synopsys Technology Update Issue No. 3, 2011, http://www.synopsys.com/Company/Publications/SynopsysInsight/Pages/Art7-circuitlifetime-IssQ3-11.aspx?cmp=Insight-I3-2011-Art7 [The00] R. Thewes, et al., "MOS Transistor Reliability under Analog Operation," in Microelectronics Reliability (Reliability of Electron Devices, Failure Physics and Analysis), Vol. 40, pp. 1545-1554, 2000. [Tu93] R.H. Tu, et al., "Berkeley reliability tools-BERT," in IEEE Trans. Computer-Aided Design of Integrated Circuits and Systems," Vol. 12, No. 10, pp. 1524-1534, 1993. [Wan07a] W. Wang, et al., "Compact Modeling and Simulation of Circuit Reliability for 65-nm CMOS Technology," in IEEE Trans. Device and Materials Reliability, Vol. 7, No. 4, pp. 509-517, 2007 [Wan11b] J. Wan, and H.G. Kerkhoff, "Boosted gain programmable OpAmp with embedded gain monitor for dependable SoCs," in IEEE Int. SoC Design Conference (ISOCC) 2011, pp. 294-297, 2011. [Wik14a] Mean time between failures on Wikipedia the free encyclopaedia, Jan 2014, http://en.wikipedia.org/wiki/Mean time between failures [Wik14b] Mean time to repair on Wikipedia the free encyclopaedia, Jan 2014, http://en.wikipedia.org/wiki/Mean\_time\_to\_repair [Wik14c] Hot carrier injection on Wikipedia the free encyclopaedia, Jan 2014, http://en.wikipedia.org/wiki/Hot-carrier injection # DEPENDABILITY ANALYSIS AND ENHANCEMENT FOR MIXED-SIGNAL SOCS ABSTRACT — With increasing system complexity and shrinking technology dimensions the dependability of critical systems has become a crucial property. Analog and mixedsignal systems, especially front/back-ends, being an integral part of these critical systems require dependability enhancement strategies for their long-term functionality. This chapter describes the details of a proposed hardware platform for enhancing the dependability of an analog and mixed-signal front-end. The Markov analysis methodology has been used to theoretically investigate and analyze the dependability improvement of the proposed strategy and is compared against the conventional strategy. Further analysis of the proposed strategy shows that dependability issues and different enhancement strategies must be considered starting from the design phase up to the end-of-life of the system. Initially, this is achieved by optimally selecting different IP's in a system. This is based on the required system dependability attributes and choice of an efficient hardware architecture platform to address the dependability issues at the design stage and during the life-time of the system respectively. The proposed approach basically links device-level simulations to a library of IP's having the same functionalities but different values for dependability attributes along with their speed, area and power overheads. This library is then further used at the design phase for the optimal selection of IP's according to the required dependability, speed, area, and power requirements and finally integrated with the proposed efficient hardware platform. The presented approach is verified by system-level behavioural simulations and this provides confidence in using the proposed strategy. #### 3.1 Introduction As discussed in Chapter 2, in moving designs from older technology nodes to nanoscale technology nodes, different degradation mechanisms like Negative Bias Temperature Instability (NBTI), Positive Bias Temperature Instability (PBTI), Hot Carrier Injection (HCI), and Time Dependent Dielectric Breakdown (TDDB) have become more important. These degradation mechanisms have been frequently studied for digital systems but they also play an important role in analog and mixed-signal system performance degradations [Bao09, Jha05, Cha10, Cha07]. Many analog operations require matched parameters and therefore, any mismatch introduced by these degradation mechanisms can result in analog system performance degradations. The degree of degradation in the performance will further result in system failure if performance degradation progresses beyond system specifications. Therefore, the dependability of these electronic systems requires that the performance parameters should remain within system specifications. Analog and mixed-signal front/back-ends, being an important part of most critical systems, especially in safety-critical (e.g. automotive, medical) and mission-critical (e.g. space, defence) systems, have received little attention with regard to dependability. Studies have shown that with increasing complexity of systems and shrinking technology dimensions the dependability of these electronic systems has become increasingly crucial [Ker10]. These studies also lead to the conclusion that in a similar way as low power, low noise, high resolution, and high speed have become important for many applications, dependability is also becoming an important design axis for these applications. Therefore, the focus of our research work and in particular this chapter is on the dependability of analog and mixed-signal systems especially the front-ends. Usually, the dependability issues of most of the electronic systems are tackled partially. The reliability, being a single important attribute of dependability, has been tackled mostly and normally only at the design-stage simulation level. In these reliability simulations, degradation models are extracted and used for simulating the lifetime behaviour of an IP. This information is then further used to redesign or incorporating circuit strategies for improving reliability [Boa09]. However, this approach lacks in dealing with *system* dependability issues related to complete IP failures, and *system* availability and maintainability issues. Therefore, it requires further considerations to deal with dependability issues during the operational life. These issues and their solutions are the topic of this chapter. This chapter has been divided into ten sections. Section 3.2 briefly summarizes the important selected attributes of dependability, their impairments, and means of enhancement in case of degradations. In order to analyse the dependability of the proposed hardware platform, presented in section 3.6, section 3.3 gives the necessary details on how to model and analyse the dependability of a system. Since the research work in this thesis is to improve the dependability of analog and mixed-signal systems. Therefore, the important issues in the case of analog and mixed-signal IPs dependability are discussed in section 3.4. Section 3.5 briefly presents the digitally-assisted analog and mixed-signal IPs that are used in building a dependable hardware platform discussed in section 3.6. This proposed hardware platform is analysed for its dependability improvements in section 3.7 and the corresponding implementation issues in this hardware platform are discussed in section 3.8. Based on these implementation issues, a possible solution in terms of an improved strategy is discussed in section 3.9. The simulation results of a target system based on this improved strategy and conclusions are discussed in sections 3.10 and 3.11 respectively. #### 3.2 DEPENDABILITY ENHANCEMENT As discussed in the previous chapter, the dependability of a system is defined as its trustworthiness that in a given environment the system will operate as expected and will not fail during its normal operation [Buj04]. Therefore, the ultimate goal of a dependable system is to provide the ability of a system that it can deliver a service that can justifiably be trusted. The service delivered by a system is its behaviour as perceived by its user(s). Here a user either could be another physical system or a human that interacts with the system. As mentioned in the previous chapter, the term dependability represents the property of the system that integrates attributes like availability, reliability, maintainability, safety, integrity and confidentiality [Avi01, Avi04, Buj06]. Therefore, addressing dependability issues means addressing its individual attributes. In other words, enhancing dependability of a system requires an enhancement of each individual dependability attribute and therefore, any proposed strategy for dependability enhancement must take care of *all* of its attributes. In order to enhance the system dependability the following questions have to be answered: - 1) what is the original dependability of the system? - 2) what are the impairments of the system dependability? - 3) how can these impairments be removed or reduced? Finally, a comparison between the original and the improved dependability can be made to determine the improvement factor. On the other hand, a comparison between the conventional strategy and the proposed strategy to enhance the dependability of the system can also be made. Usually, the impairments of dependability are defined in terms of faults and failures as discussed in the previous chapter. Dependability enhancement means include fault prevention, fault removal or fault forecasting techniques as discussed in Chapter 2. The next sections will briefly discuss why some of the attributes of dependability have been selected as the important attributes for analog and mixed-signal systems (front-ends) and how these attributes can be enhanced. #### 3.2.1 SELECTED DEPENDABILITY ATTRIBUTES The dependability of a system, based on its application, is represented by a collection of essential dependability attributes to express the dependability properties which are expected from a system. Analog and mixed-signal systems (front-ends) being an important part of most of the critical systems are the focus of our research. The dependability of these analog and mixed-signal systems requires that they should always be functioning correctly and should be maintained/repaired with minimum downtime in case of any failure. Among the different attributes of dependability the reliability, maintainability, and availability are the focus of this chapter and the whole thesis. By improving these dependability attributes, the other attributes like safety will also be partially tackled. However, the attributes like integrity and confidentiality will be slightly irrelevant and less important in this case. Therefore, these attributes will not be considered in the research work presented in this thesis. Basically, the selected attributes are probabilistic quantities and can be easily related to quantities like meantime-to-failure (MTTF), mean-time-between-failures (MTBF), and mean-time-to-repair (MTTR) [Buj04, Avi04, Ker11]. #### 3.2.2 RELIABILITY ENHANCEMENT The reliability, being the probability as a function of time that the system will be correctly functioning as intended at that time can be related to its failure free states. The higher the number of failure-free states of a system are, the higher the reliability of the system will be. Therefore, one possible way of enhancing the reliability of a system is to increase the number of failure-free states. The failure of a system is the inability of the system to perform its required functions within the specified performance requirements (specifications). The performance of a system can be retained within its specifications by a number of ways. One possible way could be to provide trimming or tuning options for different performance parameters. Therefore, in case of performance deviations these trimming or tuning options for different performance parameters can be used to retain the system within specifications. Another possibility could be to provide redundancy (hardware in the present case) in the system. In case of performance deviations, redundant (fault-free) hardware can be activated replacing the faulty one. Therefore, by providing trimming, tuning or redundancy options the system performance can be retained within specifications and the system failures can be minimized and hence the system reliability can be enhanced. Mathematically, for repairable systems, it can be related to the MTBF of the system. Therefore, by increasing MTBF the reliability of the system will be enhanced. #### 3.2.3 MAINTAINABILITY ENHANCEMENT The maintainability, being the probability as a function of time that the system can be repaired if it fails to perform its correct function as intended at that time, can be related to the ease and speed of the repairing mechanism to repair or restore the system following a failure. As stated above, the failures of a system are related to its performance and in order to decrease failures the performance should remain within the specifications. Therefore as a requirement to improve maintainability, the repairing mechanism should be capable of sufficiently repairing or restoring the required system performance within their specifications and as quickly as possible. This can be established by providing enough repairing, tuning or redundancy options. By increasing the repair rate, defined as the available number of repairs that can be performed in a unit time, the repairing capability or the speed of the repairing mechanism will increase and hence the possibility to restore system performance within its specifications or the system maintainability can be enhanced. Mathematically, for repairable systems, the speed of the repairing mechanism can be related to Mean Time to Repair (MTTR). Therefore, by decreasing the MTTR the maintainability of the system can be increased. #### 3.2.4 AVAILABILITY ENHANCEMENT The availability, being the probability as a function of time that the system will be available for correct service at that time, can be related to the failure-free states of the system. For a failure-free state the system should neither be in failure nor in the repairing state. Therefore as a requirement to improve availability, the performance, being related to failures as stated above, should be kept within its specification boundaries and the time required to repair a system should be minimized. This can be achieved either by providing fault tolerant and redundancy options or by anticipating in advance the possible repairing options in order to reduce the repair time and hence the system availability can be enhanced. Usually, availability is considered for repairable C H A P Т E R 3 $$A = \frac{MTBF}{MTBF + MTTR} \tag{3.1}$$ This shows that the availability of the system can be enhanced by increasing *MTBF* or by decreasing *MTTR*. In other words, by increasing reliability and maintainability the availability of the system will also increase. #### 3.3 DEPENDABILITY ANALYSIS In order to evaluate the efficiency of any proposed approach for dependability enhancement, it is important to analyse the improvement in individual attributes of dependability. Typically, there are two conventional approaches to analyse the dependability of a system. The first approach deals with development and evaluation of models and the second approach deals with the actual testing of the system. In the first approach models based on the failure rate of its components, available from the handbooks or manufacturers data, are used to analyse the dependability, usually reliability, of the overall system. Whereas, in the second approach real test data is used that might be very costly and may require lengthy test procedures for large and complex systems. Therefore, in this chapter the first approach will be used to analyse the efficiency of our proposed approach for dependability enhancement. The focus will be to analyse improvement in the selected attributes of dependability; the reliability, maintainability, and availability of the system. This can be achieved by using, for example, reliability block diagrams and Markov process as discussed in the following sections. #### 3.3.1 DEPENDABILITY MODELLING The dependability modelling of any system can be classified into two classes: combinatorial modelling and stochastic modelling [Mat14]. The combinatorial modelling techniques include reliability block diagrams, fault trees, and reliability graphs. Whereas, the stochastic techniques are based on the state-space and time-space characteristics and can be classified into discrete Markov models and continuous Markov models. The focus of the current chapter will be on the reliability block diagrams and Markov models because of their easy to use modelling approach, repair/replace mechanism inclusion, and modest computational requirements. The accuracy depends on the accuracy of the information (failure rate, repair rate, reliability of sub-blocks etc.) used in these modelling techniques. The next sections will briefly describe the necessary background details which is further used to study the dependability enhancement of the proposed strategy in the later sections. ## 3.3.1.1 RELIABILITY BLOCK DIAGRAMS (RBDS) The reliability block diagrams (RBDs) are one of the most commonly used models for reliability analysis because of their simplicity and ease of use in modelling complex systems especially with redundancy [Mat14]. The RBDs present an abstract view of the whole system where each component/IP is represented by a separate block. The operational dependency among the blocks is represented by the interconnections that could be serial or parallel depending on the overall operation of the system. In order to analyse the reliability of a complex system, first the whole system is partitioned into serial and parallel sub-systems and then the RBDs of these individual sub-systems are combined into one for the overall reliability of the whole system. The reliability of a serial system, where sub-systems (uncorrelated) are connected in series, requires that all of the sub-systems should be operational; a failure in one sub-system will result in system failure. Therefore, the reliability of a serial system is given by [Mat14]: $$R_{serial}(t) = \prod_{i=1}^{n} R_i(t)$$ (3.2) where $R_i(t)$ is the reliability of the $i^{th}$ component in series, for $i \in \{1,2,3,...,n\}$ . Similarly, for a parallel system, only one sub-system is required for the system to be operational. Therefore, the reliability of a parallel system is given by [Mat14]: $$R_{parallel}(t) = 1 - \prod_{i=1}^{m} (1 - R_i(t))$$ (3.3) where $R_i(t)$ is the reliability of the $i^{th}$ component in parallel, for $i \in \{1,2,3,...,m\}$ . Different IPs (components) on a single chip will be uncorrelated and their individual reliability will be independent of each other although they will be working in a similar environment. Their individual reliability will be dependent on their individual architecture. Therefore, equations (3.2) and (3.3) are also valid for serial and parallel components (IPs) on a chip and are further used in section 3.7 to evaluate the reliability of the proposed strategy. #### 3.3.1.2 MARKOV ANALYSIS Markov analysis have been used since decades to analyse the dependability (reliability/availability) of fault tolerant systems with non-repairable as well as repairable components. It is normally used because of its easy to use modelling approach, repair/replace mechanism inclusion, and modest computational requirements. Markov analysis is a special class of stochastic processes that was formally described by a Russian mathematician A.A. Markov (1856-1922). Markov analysis starts with formulating a Markov state model by breaking the whole system into a number of states represented by circles (bubbles). These circles are then connected by directional arcs representing the transition rates (in our case failure and repair rates; failure per hour and repair per hour) at which the system moves from one state to another state. In general, these transition rates can be time varying, allowing the Markov state model to represent (32) Figure 3.1: Block diagram of a conventional analog and mixed-signal front-end a variety of different densities for times spent in a state before the system moves to the next state [Buk06]. Solution of such a state model using the state-space approach then predicts the probability that the system will be in various states after any specified time interval. The sum of these probabilities over non-failure states then yields the system reliability [Kum09]. The basic assumption in Markov analysis is that the behaviour of the system in each state depends on the present state of the system and not on the previous state or the time at which it reached the present state. In dependability engineering, this assumption is satisfied if all events (failures, repairs, etc.) in each state occur with constant occurrence rates. For a large number of similar systems this occurrence rate can also be considered constant because it will be approximately close to an average value. This means the time spent in each state follows an exponential distribution. Mathematically, the Markov model is completely described by its transition matrix A(t) where, for every $i \neq j$ (i = row, j = column) the $ij^{th}$ entry represents the transition rate from state 'i' to state 'j', and, for every i = j, the $ij^{th}$ entry is represented by minus the sum of the entries in the rest of the $i^{th}$ row. The diagonal entries are such that each row of A(t) sums to zero (by definition). The behavior of the Markov state model is then governed by the following differential equation [Buk06]: $$\frac{dP(t)}{dt} = P(t) * A(t) \tag{3.4}$$ where P(t) is an $1 \times n$ row vector, A(t) is an $n \times n$ matrix and n is the number of states in the system. The solution of equation (3.4), which gives the probability in each state of the system, is given by: $$P(t) = P(0) * [e^{A(t)*t}]$$ (3.5) where $e^{A(t)*t}$ is an $n \times n$ matrix and P(0) is an $1 \times n$ initial probability row vector describing the initial state of the system. Equation (3.5) is further used in section 3.7 to analyse the dependability of the proposed strategy. #### 3.4 DEPENDABILITY OF ANALOG AND MIXED-SIGNAL FRONT-ENDS The dependability of an analog and mixed-signal front-end will depend on the dependability of its constituents. The constituents of a conventional simplified analog and mixed-signal front-end can be considered as a sensor, amplifier/filter and an analog-to-digital converter. For simplicity, a temperature sensor can be considered to serve the purpose of a general sensor followed by an operational amplifier (Analog IP) and an ADC (mixed-signal IP) as shown in Figure 3.1. In this way, the output of the 33 C H A P T E R 3 temperature sensor will be amplified by the operational amplifier and then converted into a digital value by the ADC for further processing in the digital domain. Therefore, the performance of the operational amplifier and the analog-to-digital converter will affect the performance of the front-end. As stated above, the dependability attributes namely the reliability, maintainability and availability of a system are related to its performance. Therefore, any change in the performance of operational amplifier (OpAmp) or analog-to-digital (ADC) converter beyond its specifications could potentially affect the dependability of the system. The dependability improvement of the analog and mixed-signal front-end, the OpAmp and ADC being connected in series, requires that the dependability of OpAmp and ADC should be improved individually (the sensor part is not considered here as discussed in Chapter 1, section 1.5). The next sections will discuss how the dependability of the OpAmp and ADC can be enhanced individually in order to improve the dependability of the whole front-end. #### 3.5 DIGITALLY-ASSISTED ANALOG AND MIXED-SIGNAL IPS The concept of digitally-assisted analog and mixed-signal IPs has been introduced where digital signals are used to realize, control, improve, and change the circuit functionalities in the analog domain [Xin10]. This is because of the fact that as the technology is scaling the area and power consumption of analog circuits is not scaling with the same pace as their digital counterparts. Furthermore, digital circuits are cheap and flexible in terms of performance improvements. In addition, these digitally-assisted analog and mixed-signal IPs also provide the opportunity to estimate analog performance based on the digital domain data. This makes them a better choice in terms of observability, controllability, and functional flexibility along with associated area overheads as compared to the conventional analog and mixed-signal circuits. These digitally-assisted analog and mixed-signal IPs can be found in many applications including amplifiers [Mur06], ADCs [Mur06, Sir04, Mur07], RF transceivers [Vas03] and sigma-delta modulators [Ros05]. These digitally-assisted analog and mixed-signal IPs provide better dependability [Wan11] and therefore, can be used to construct a dependable hardware architecture for dependability improvements. The next section will discuss the use of these digitally-assisted analog and mixed-signal IPs in a possible hardware architecture that can potentially be used to enhance the dependability of analog and mixed-signal front-ends. #### 3.6 Initial Proposed Hardware Platform Based on the requirements, discussed before, to enhance the individual attributes of dependability, a hardware platform is proposed for analog and mixed-signal front-ends as shown in Figure 3.2. This proposal, despite the fact of increased cost and area, is based on the usual concept of hardware redundancy. This has been an established way of providing fault-tolerance in systems especially in safety/mission critical systems. In this initial proposed hardware platform (IPHP) each analog and mixed-signal IP has digital assistance. That has made it possible to digitally monitor, re-program or tune different parameters [Mur06, Sir04] of individual IPs. Therefore, hardware redundancy as well as digital monitoring and controlling (programming/tuning) capabilities of different analog and mixed-signal IPs has been utilized in this IPHP in order to achieve dependability improvements. Among other important blocks of the IPHP the "Diagnose & Action" IP is of crucial importance. This IP is responsible for diagnosing the system performance, based on the different system performance parameters, and taking actions accordingly to maintain the dependability requirements of the whole system. The switches $SW_1$ and SW<sub>2</sub>, which are controlled by the "Digital Processor" IP, are used to switch between the two possible operational modes; namely the diagnosis mode and the normal mode. Whereas the switch matrix 'S', composed of switches $SW_{11}$ , $SW_{12}$ , ..... $SW_{43}$ , is used to select the different combinations of analog and mixed-signal IPs to form different possible active paths as shown in Figure 3.3 (1 = switch closed, 0 = switch open). According to the IPHP, each active path consists of one digitally assisted "Analog IP" and one digitally assisted "Mixed-Signal IP". The performance of each IP can be checked individually by properly using the switches $SW_1$ and $SW_2$ and the switch matrix 'S'. Each individual IP can be isolated from the rest of the IPs to be individually diagnosed by the "Diagnose & Action" IP. The isolation or bypass mechanism is not shown in Figure 3.2 to make it clear for other switching mechanisms as described above. ### 3.6.1 WORKING PRINCIPLE The working principle of the IPHP is different from the usual triple modular redundancy (TMR) concept. Despite the usual concept of a TMR where all of redundant modules are active at the same time and a voter is responsible to decide about the correct behaviour. In the IPHP only one path and its corresponding IPs are active. The rest of the IPs in the other paths are not active. To fully understand the working principle of the IPHP, let us consider that initially all the analog and mixed-signal IPs are fully dependable [Buj04] and switches $SW_{11}$ , $SW_{21}$ , $SW_{31}$ , and $SW_{41}$ are closed to form an active path from switch $SW_{11}$ to switch $SW_{41}$ via switches $SW_{21}$ and $SW_{31}$ (Figure 3.2). Therefore, "Analog IP 1" and "Mixed-Signal IP 1" are active and the rest of the analog and mixed-signal IPs are not active. This means the rest of the IPs are powered off and for simplicity we assume they will remain fully dependable unless activated for their operation. During the lifetime operation of the system, the "Digital Processor" IP will switch $SW_1$ and $SW_2$ into diagnosis mode at predefined regular intervals of time. The frequency of switching from normal operational mode to diagnosis will be a combination of application and degradation profile. Similarly, at the same time the "Digital Processor" IP will activate the "Diagnose and Action" IP to diagnose the performance of each analog and mixed-signal IP in the active path. Each individual IP can be isolated from the rest of the IPs to be individually diagnosed by the "Diagnose & Action" IP. During diagnosis mode, the "Diagnose & Action" IP will first try to verify the performance of the current active path (in the present case this active path is from switch $SW_{11}$ to switch $SW_{41}$ via switches $SW_{21}$ and $SW_{31}$ ) IPs for their performance parameters specifications. If the performance parameters of each IP in the current active path are within specifications then the "Diagnose and Action" IP will take Figure 3.2: Initial proposed hardware platform (IPHP) for dependability enhancement Figure 3.3: Switch matrix 'S' and the possible active paths no actions. On the other hand, if there are deviations in the performance parameters of any of the IPs in the current active path it will start taking actions accordingly. This means, the performance of the current active path is not within specifications and requires different actions to be maintained within specifications. ### 3.6.2 DEPENDABILITY IMPROVEMENTS The performance of the current analog and mixed-signal front-end will be maintained by using the following three steps: - 1) The performance of each analog and mixed-signal IP in the current active path will be diagnosed and compared against its specifications. - 2) In case of performance deviations the digital programing/tuning capabilities of each analog and mixed-signal IP in the current active path will be used to tune back its performance parameters within specifications. - 3) Normally, these digitally-assisted analog and mixed-signal IPs have a limited tuning range [Mur06, Sir04] allowing each IP to be tuned back up to a specific level. Therefore, in case any of the digitally assisted analog and mixed-signal IP in the current active path will be out of its tuning range, it will be replaced with a new/fresh (redundant/spare) IP. **(36**) C Η 3 The purpose of the switch matrix is to provide greater flexibility and tuning space for the "Diagnose & Action" IP in order to maintain the performance of the whole system. Therefore, the basic idea behind this scheme is first to digitally program/tune each IP to maintain the performance requirements of the whole system and once an IP is out of its tuning range then replace this with a new/fresh (redundant/spare) IP. This rerouting (replacement) will not affect the overall behaviour because the performance will remain within specification. However, the speed and accuracy of the monitoring and tuning circuitry will be an important issue and has to be considered by electronic designers. As stated above, improving dependability means improving its individual attributes. Therefore, the initial proposed hardware platform takes care of the individual attributes of dependability. For example: - the *reliability*, being the continuity of correct service, is improved by taking regularly monitoring the performance and by programing/tuning actions in case of performance deviations from the required specifications. - the *maintainability* is improved by providing fast and accurate digital programing/tuning capabilities inside the system for taking repair actions. - the *availability* being the readiness for correct service is also improved by providing digital programing/tuning and hardware redundancy options. These improvements will be further analysed mathematically and compared against conventional strategies in the next sections. #### 3.7 ANALYSING DEPENDABILITY OF THE IPHP As discussed earlier there are a number of available techniques for analysing the dependability of a system, for example stochastic Petri nets, fault-trees, reliability block diagrams and Markov analysis [Mat14]. In this thesis, a combination of Markov analysis and reliability block diagrams will be used to analyse the dependability attributes of the initial proposed hardware platform (IPHP) for analog and mixed-signal front-ends. #### 3.7.1 FORMULATING THE MARKOV STATE MODEL In order to use Markov analysis to analyse the dependability of the IPHP first a Markov state model has to be constructed for each analog and mixed-signal IP. As an example, each analog and mixed-signal IP has two digital tuning knobs (ports), as shown in Figure 3.4, that are used to tune its performance back to its specifications. This means, every analog or mixed-signal IP has four different states, as shown in Table 3.1, where each state corresponds to a unique value of its digital tuning knobs (ports) and represents a fault-free (functioning correct) state (states 1, 2, 3, 4) of the corresponding IP. Furthermore, to complete the Markov state model for this IP two more states are introduced. State 'R' which corresponds to a state where an IP is being repaired and a state 'F' which represents a failed state. In case of a failed state the corresponding IP Table 3.1: Fault-free states as function of tuning knobs | $N_1 N_2$ | Fault-Free State | |-----------|------------------| | 0.0 | 1 | | 0.1 | 2 | | 1 0 | 3 | | 1 1 | 4 | Figure 3.4: Block diagram of an analog/mixed-signal IP with two digital tuning knobs (N1, N2) Figure 3.5: Markov state space model for an analog/mixed-signal IP (1, 2, 3 and 4 are fault-free states) cannot be further repaired by digital tuning knobs (ports) but could potentially be replaced by a spare IP to continue its functional role in the whole system. Therefore, the Markov state model of each IP now consists of six states as shown in Figure 3.5. Obviously, the total number of states will increase if higher numbers of digital tuning knobs are considered. These states are then connected by arcs representing the transition from one state to another state. In Markov analysis these transitions are described by probabilities and usually represent the failure and repair probabilities. In Figure 3.5, $\lambda_{ij}$ represents the failure rates and $\mu_{ij}$ represents the repair/replace rates. Where *failure rate* means the expected number of failures (probability) per unit time and *repair/replace rate* means the expected number of repairs/replaces (probability) per unit time. The subscripts 'i' and 'j' represents the direction of the failures or repairs/replaces occurring between these states. For example, $\lambda_{1R}$ represents the failure rate from state 1 to state 'R'; similarly $\mu_{R2}$ represents the repair rate from state F to state 1. Furthermore, state 4 will not be repaired as states 1, 2, and 3 are repaired to the next states. Therefore in case of a failure, the state 4 will directly move to failed state 'F' as shown in Figure 3.5. It has been shown in [Buk06] that Markov state models with exponential time densities will give the same results for steady-state probabilities as the more complicated non-exponential time densities. Therefore, instead of using more complex failure-time and repair-time densities, exponential time densities, where both failure rates and repair rates are constant, will be used to study the steady-state (38) C Η A P T E R 3 Table 3.2: Failure rate ( $\lambda$ ) and repair/replacement rate ( $\mu$ ) values probabilities. For example, if $\lambda$ is a constant failure rate and $\mu$ is a constant repair rate then $\lambda e^{-\lambda t}$ and $\mu e^{-\mu t}$ will be the exponential failure and repair time densities respectively [Mat14]. The state transition matrix, as described in section 3.3.1.2, of this Markov state model (Figure 3.5) with constant failure and repair/replace rates is: $$A = \begin{bmatrix} -\lambda_{1R} & \lambda_{1R} & 0 & 0 & 0 & 0 \\ 0 & -\mu_{R2} - \mu_{R3} - \mu_{R4} - \lambda_{RF} & \mu_{R2} & \mu_{R3} & \mu_{R4} & \lambda_{RF} \\ 0 & \lambda_{2R} & -\lambda_{2R} & 0 & 0 & 0 \\ 0 & \lambda_{3R} & 0 & -\lambda_{3R} & 0 & 0 \\ 0 & \lambda_{4R} & 0 & 0 & -\lambda_{4R} & 0 \\ \mu_{F1} & 0 & 0 & 0 & 0 & -\mu_{F1} \end{bmatrix}$$ (3.6) By using equation (3.6) in equation (3.4) the following set of differential equations is obtained. $$\frac{dP_{1}(t)}{dt} = -\lambda_{1R} * P_{1}(t) + \mu_{F1} * P_{F}(t) \frac{dP_{R}(t)}{dt} = \lambda_{1R} * P_{1}(t) - (\mu_{R2} + \mu_{R3} + \mu_{R4} + \lambda_{RF}) * P_{R}(t) + \lambda_{2R} * P_{2}(t) + \lambda_{3R} * P_{3}(t) + \lambda_{4R} * P_{4}(t) \frac{dP_{2}(t)}{dt} = \mu_{R2} * P_{R}(t) - \lambda_{2R} * P_{2}(t) \frac{dP_{3}(t)}{dt} = \mu_{R3} * P_{R}(t) - \lambda_{3R} * P_{3}(t) \frac{dP_{4}(t)}{dt} = \mu_{R4} * P_{R}(t) - \lambda_{4F} * P_{4}(t) \frac{dP_{F}(t)}{dt} = \lambda_{RF} * P_{R}(t) - \mu_{F1} * P_{F}(t)$$ (3.7) The above set of equations can be solved numerically by using $P(0)=[P_1(0)\ P_R(0)\ P_2(0)\ P_3(0)\ P_4(0)\ P_F(0)]=[1\ 0\ 0\ 0\ 0\ 0]$ as the initial state probabilities and the values of Table 3.2 as the constant failure and repair rates. In Table 3.2 all the values are expressed as failures/repairs per hour. As an example, *every* fault-free state (i.e. 1, 2, 3, and 4) is *assumed* to have a failure rate of 1 failure per one thousand hours (1/1000). Therefore, state 1 will fail once in one thousand hours while state 2 will fail once per two thousand hours. This is because of the fact that state 2 will be active only if state 1 will fail. State 2 will fail with the same rate as state 1 but it will be active only when state 1 will fail. Therefore, to use Markov analysis, where each state is independent from the previous state, the failure rate for state 2 must be once per two thousand hours (1/2000). Similarly state 3 and state 4 will fail once in three and four thousand hours respectively. In the same way, once a failure will occur from state 1 to state R it will be Figure 3.6: Probability in each state of the analog/mixed-signal IP as a function of time ( $\lambda$ =1/1000) repaired to state 2. This means the repair rate from state R to state 2 will be once every thousand hours (1/1000) and the subsequent repair rate, to apply Markov analysis, for state 3 and state 4 will be once per two and three thousand hours respectively. Furthermore, the replacement rate from state F to state 1 will be once in four thousand hours (1/4000). #### 3.7.1.1 RELIABILITY CALCULATION The solution of the differential equations (3.7), discussed above, will provide the probability of an analog/mixed-signal IP in each state as a function of time which is shown in Figure 3.6. This figure shows that the probability of analog/mixed-signal IP is decreasing in state 1 while the probability that the analog/mixed-signal IP will be in states 2, 3, 4, R, and F is increasing with time. The sum of these probabilities over fault-free states (i.e. 1, 2, 3, and 4) then yields the *reliability* of each analog/mixedsignal IP as a function of time. Similarly, the sum of the probabilities over the states 'R' and 'F' then yields in the unreliability of each analog/mixed-signal IP as a function of time. Furthermore, in order to calculate the reliability of the whole system, composed of redundant analog/mixed-signal IPs, Reliability Block Diagrams (RBD) are used. Usually, RBDs are not used to calculate the reliability of a repairable system, but one can use RBDs under the assumption that each repairable block behaves like a nonrepairable block and its reliability is independent from the reliability of the other blocks as mentioned in Section 3.3.1.1. This assumption can be made valid by assigning $\mu_{\rm F1} = 0$ (i.e. no replacement in Figure 3.5) in the above transition matrix (3.6) and recalculating the reliability of each repairable IP. This will give the total reliability of each analog/mixed-signal IP which has only digital tuning capabilities without having any replacement capabilities. This means the Markov analysis is first used to calculate the reliability of each analog/mixed-signal IP having digital tuning capabilities and then this calculated reliability is further used in RBDs to incorporate the redundant nature or replacement capabilities of the proposed strategy. (40) Figure 3.7: Reliability Block Diagram (RBD) of an analog and mixed-signal front end for the initial proposed hardware platform (IPHP) containing redundant analog and mixed-signal blocks The working principle of the proposed platform suggests that it can be considered as composed of two main IPs connected in series, as shown in Figure 3.7. Where each main IP is composed of three parallel sub-IPs; each being independent from any other IP. This independent nature of each IP from other IPs can be achieved by considering the influence of each IP on the other IP while calculating its failure rate. In this way, the dependence of one IP on another IPs will shift from IP level to its failure-rate level which can be further used in Markov analysis, as described above, and making them independent at IP level. The reliability of each sub-IP ( $R_i(t)$ ) can be recalculated by using Markov analysis, as stated above. Therefore, by using the principle of RBD, the overall reliability ( $R_p(t)$ ) of each main IP, being composed of three sub-IPs in parallel, can be calculated by using equation (3.3). That is: $$R_p(t) = 1 - \prod_{i=1}^{n} (1 - R_i(t))$$ (3.8) Here n represents the number of parallel sub-IPs. Similarly, once the reliability of each main IP $(R_p(t))$ is calculated, the reliability of two main IPs connected in series $(R_s(t))$ can be calculated using the equation (3.2). That is: $$R_{s}(t) = \prod_{p=1}^{m} R_{p}(t)$$ (3.9) Here *m* represents the number of main IPs in series. Using equation (3.9), the overall reliability of the initial proposed hardware platform can be calculated. In order to compare results with another system that does not have digital tuning knobs for its sub-IPs, a triplicate system with the same number of parallel and serial IPs has been considered. Figure 3.8 compares the reliability results of the IPHP having repairable IPs and the triplicate system having non-repairable IPs. This figure shows that the reliability of the whole system with repairable IPs (WRIP) is increased as compared to the reliability of the triplicate system having non-repairable IPs (NRIP). This reliability increase is directly related to the repair rate and the number of available fault-free states (e.g. 2, 3, 4 in the current case). This means, the reliability can be increased further either by increasing the repair rate or by providing more fault-free states. The percentage of reliability improvement is defined as: $$\% Improvement = \frac{Reliability of the IPHP - Reliability of TS}{Reliability of the IPHP} \times 100$$ (3.10) 41 H A P T E R C 3 Figure 3.8: Reliability of the initial proposed hardware platform (IPHP) with repairable IPs (WRIP) and a triplicate system (TS) with non-repairable IPs (NRIP) ( $\lambda$ =1/1000). Figure 3.9: Percentage of reliability improvement of the proposed hardware platform with repairable IPs (WRIP) over a triplicate system with non-repairable IPs (NRIP) ( $\lambda$ =1/1000). where IPHP stands for initial proposed hardware platform and TS stands for triplicate system. Figure 3.9 shows the percentage of improvement as a result of the proposed strategy as compared to the triplicate system strategy. Initially, the percentage of improvement is low because both the strategies have similar reliabilities; however as the time passes the proposed strategy dominates over the triplicate system strategy. This means if the triplicate system strategy is not capable of maintaining the reliability of the system, the proposed strategy will do a better job to maintain the overall reliability of the system. #### 3.7.1.2 MAINTAINABILITY CALCULATION The maintainability of the whole system, being the probability that the system is successfully repaired while it fails, can be calculated by estimating the contribution of 42 Figure 3.10: Maintainability of the initial proposed hardware platform (IPHP) with repairable IPs (WRIP) and a triplicate system (TS) with non-repairable IPs (NRIP) ( $\lambda$ =1/1000) the repair/tuning mechanism to decrease the unreliability (1-reliability) of the non-repairable system. In other words, by calculating the contribution of repair/tuning mechanism to increase the reliability of a repairable system as compared to the non-repairable system. Mathematically, this can be accomplished by subtracting the sum of probabilities in state R and state F of a repairable IP system from the probability of failure (unreliability = 1- reliability) of the system with no repair mechanisms for IPs (i.e. no digital tuning or replacement options). Maintainability = $$P_{SIPHP}$$ = $P_{FNRIP}$ - Sum of Probabilities in States R and F (3.11) Where $P_{SIPHP}$ stands for probability of successfully repaired initial proposed hardware platform and $P_{FNRIP}$ stands for probability of failure of the system with no repair mechanisms for IPs. This difference gives the value of the probability of a fault-free system that has been increased due to repair mechanisms; which was not possible for non-repairable components with no replacement mechanism. This is called the maintainability or the probability that the system was successfully repaired (i.e. it is in the fault-free state) when it failed. In order to compare results with another system that does not have digital tuning knobs for its sub-IPs but only has a replacement mechanism, a triplicate system with the same number of parallel and serial IPs has been considered. Figure 3.10 compares the maintainability results of the IPHP having repairable IPs and the triplicate system (TS) having non-repairable IPs. This shows that the maintainability of the IPHP increases with increasing repair rate assuming a successful repair. However, the maintainability of the triplicate system becomes zero after some time. #### 3.7.1.3 AVAILABILITY CALCULATION The availability of the initial proposed hardware platform (IPHP), being the probability as a function of time that the system will be available for correct service at Figure 3.11: Availability of the initial proposed hardware platform (IPHP) with repairable IPs (WRIP) and a triplicate system (TS) with non-repairable IPs (NRIP) ( $\lambda$ =1/1000). that time, can also be related to the probabilities in different states. The probability in states 1, 2, 3, and 4 gives the probability that the system will be correctly functioning (continuity of correct service) and the probability in states R, and F gives the probability that the system will not be functioning correctly (failure continuity). Therefore, by having the probability of correctly functioning and the probability of not-correctly functioning (i.e. reliability and unreliability), the availability of the hardware platform can be calculated by finding the ratio of probability that the system is correctly functioning to the total probability of being correctly functioning and not-correctly functioning as described in equation (3.1). If the probability that the system will be correctly functioning relates to mean-time-between-failures MTBF (up-time) and the probability that the system will not be correctly functioning relates to mean-time-to-repair MTTR (down-time) then equation (3.1) can be redefined in terms of probabilities as: $$A_{probability} = \frac{(P_1 + P_2 + P_3 + P_4)}{(P_1 + P_2 + P_3 + P_4) + (P_R + P_F)}$$ (3.12) Where $P_i$ represents the probability of the system in state 'i' (e.g. states 1, 2, 3, 4, R, and F). Since $(P_1 + P_2 + P_3 + P_4) + (P_R + P_F) = 1$ at any particular time, therefore the availability of the IPHP will also follow the same pattern as the reliability of the IPHP. Figure 3.11 shows the availability of the IPHP platform. This shows that by increasing the repair rate or decreasing the ' $P_R + P_F$ ' term in equation (3.12), the availability of the whole system can be enhanced up to 99.99 per cent. This also shows a comparison between the availability of the proposed strategy and the availability of the triplicate system with non-repairable IPs. This means initially, the availability is similar in both cases however as the time progresses the triplicate system is not capable of maintaining the availability whereas the proposed strategy can maintain the availability for a longer time. (44 #### 3.8 POTENTIAL IMPLEMENTATION ISSUES Although the initial proposed hardware platform (IPHP) seems very similar to a conventional TMR concept, it is a new concept where digitally-assisted IPs are used with an external (on-chip) test-input signal, a complex switching mechanism, and a "Diagnose and Action" IP. Therefore, the IPHP along with on-chip testing and digital repairing mechanism requires a lot of design effort from the analog and mixed-signal design community. This may challenge electronic designers to cope with the following implementation issues: - Lowering hardware area and power consumption overheads will be a challenge for electronic designers (proposed potential solution in Chapter 3). - A single test-input signal that can potentially be used to analyse all of the performance parameters of the system will be difficult to manage. Therefore, it will require more specific test-input signals and hence the implementation issues to generate all of these test-input signals (proposed potential solution in Chapters 3 and 4). - Monitoring all of the performance parameters and comparing against their specifications will require complicated monitoring and comparing circuits (proposed potential solution in Chapter 4). - The monitoring accuracy and the corresponding digital repairing accuracy will play a significant role in the overall improvement of the system dependability. Therefore, designing monitoring and repairing circuits with the required accuracies will be another challenge for the electronic designers (proposed potential solution in Chapter 3). - Designing fault tolerant switches, to avoid single points of failure, with minimum loading, leakage, interconnect path changing time and noise effects will be another challenge (proposed potential solution in Chapter 3). - Despite the diagnosing and repairing accuracy, the required speed to diagnose the performance deviation and the corresponding repair, if necessary, in order to minimize the MTTR will be another challenge for electronic and software designers. This could create availability and maintainability problems (proposed potential solution in Chapter 3). Some of the above mentioned implementation issues will be either addressed in this chapter or in the next chapters as mentioned above (in brackets). ## 3.9 IMPROVING THE PROPOSED STRATEGY The above results show that the initial proposed hardware platform provides a high dependability, more precisely reliability, availability, and maintainability improvements for analog and mixed-signal based SoCs. The penalty one has to pay is in terms of area, Figure 3.12: The conceptual flow for improving the proposed strategy speed, power and control overhead. Therefore, in order to address these issues an improved strategy starting from the design stage to the end-of-life of the system is required. The conceptual flow of such a strategy is shown in Figure 3.12. It starts with the construction of a library of dependable IPs which is further used to select the best combination of IPs to optimize the dependability, area, speed, and power issues of the system. This is followed by the construction of the dependable hardware architecture with little area and complexity overheads. The library of dependable IPs is a collection of different IPs having the same functionalities but different values for dependability attributes, speed, area, and power overheads. The new dependable hardware system architecture is based upon the similar concept of digitally-assisted analog and mixedsignal IPs as discussed above but with single redundant hardware as compared to two in the previous case. Similarly, contrary to the previous architecture where an external test signal is used, some on-chip measurement architecture has been used to calibrate the performance of each analog and mixed-signal IP as proposed in [Xin10]. Dependability issues related to complete IP failures, and system availability and maintainability issues have been resolved in a similar way; that is by providing built-in tuning/trimming and complete IP switching/replacement options. ## 3.9.1 CONSTRUCTING A DEPENDABLE IP LIBRARY In order to create a dependable IP library, the dependability of each IP in the system has to be established at each attribute level, namely reliability, maintainability, and availability. The *reliability* of an IP can be estimated by means of reliability simulations at the design stage where the reliability simulation flow usually consists of two phases [Bao09, Liu06]. In the first phase the given IP is simulated using aging models according to the technology information. The application determines the different stress profiles (e.g. temperature). The combination of stress profiles with different failure models, built from experimental testing work and structural circuit information, can used in a similar way to select reliability critical substructures or devices [Bao09]. In the second phase, equivalent circuits are generated and incorporated into the original IP at system level. These are then simulated for degraded lifetime behaviour to estimate reliability as a function of time [Bao09]. As discussed before, in case of a repairable system, the *availability* of an IP can be related to its mean-time-between-failures (MTBF) and mean-time-to-repair (MTTR) C Η A P Т E R 3 verification. Normally, parameters like MTBF and MTTR can be measured in two different ways. One way is to estimate these parameters at simulation level, where by observing performance parameters and their deviation from normal values (failure) as a result of degradation process or other faults will give us an estimate of MTBF. This will require exact information on design specifications and the actual environmental conditions. Similarly, by having a repairing mechanism, simulations can be run to estimate the time required to detect, diagnose and repair the fault and hence resulting in an MTTR. Also another way of estimating these parameters (MTBF, MTTR), which will require more effort in terms of time and cost, is conducting real accelerated test measurements under different stress conditions and calculate the real MTBF and MTTR values with much higher accuracy as compared to simulations. and can be calculated by using equation (3.1) [Ker11]. Similarly, the *maintainability* of an IP can be related to MTTR and will be a function of the time required to detect a failure, diagnose the problem and to take proper actions to correct this problem. The maintainability will further depend on the complexity of the circuit and the degree/accuracy to which this problem has to be resolved. A high complexity of the circuit (IP) and high tuning/trimming accuracy will require more time to diagnose and subsequently repair the problem. Therefore, the MTTR of an IP can be calculated based upon the fault detection time, communication time with the diagnosing circuit, and the time required to properly diagnose and subsequently repair the problem and final Based on the above information, critical substructures or devices (i.e. transistors) can be identified and dependability improvement suggestions can also be made. These substructures or devices can then be redesigned or new structures can be inserted [Bao09] to achieve different values for dependability attributes. In this way one can generate multiple IPs with the same functionality having different dependability attributes along with their speed, area, and power overheads. Therefore, by having this information, each IP can be labelled with its reliability, availability, maintainability, area, speed, and power requirements which can be used to build a library of dependable IPs. #### 3.9.2 **OPTIMIZING SYSTEM DEPENDABILITY** Once a library of different dependability-level (having different values for dependability attributes) IPs is obtained, the next step is to select the best IP(s) that can give us a better compromise between reliability, availability, maintainability, speed, area, power etc. as required by the application or customer. This can be achieved by Linear Programming (LP) techniques like the Simplex algorithm [Ker11]. For example, a mixed signal front-end which has only two IPs being an operational amplifier and an analog-to-digital converter. If each of these IPs has four different flavors in the IP library in terms of different values for dependability attributes, speed, area and power consumption then there will be sixteen different combinations in which these IPs can be connected (OpAmp followed by an ADC). The reliability, availability, and maintainability of these two IPs connected in series can be calculated using Reliability Block Diagrams (RBD) whereas the speed, area and power requirements can be simply added. By having this information one can determine which combination is the best option from either reliability point of view or area point of view etc. In this way one can Figure 3.13: New proposed hardware platform (NPHP) for achieving high system dependability select the best option for combining IPs to meet the requirements set by the application or customer. As described in the last section, availability in case of repairable system is highly dependent on maintainability of the system. Therefore, in order to achieve higher availability, one has to reduce the repair time. In some cases, especially for analog and mixed-signal IPs, this can be quite hard. This is because of the fact that they require a longer time to calibrate the performance and diagnose the problem in case of any failure in performance as compared to the digital counter parts. Furthermore, this early selection of dependable IPs for achieving better dependability levels lacks the scenario of the case that any of the selected IPs completely fails, resulting in complete system failure. These problems of achieving higher availability and avoiding complete system failures require in addition an improved hardware architecture. Such a hardware platform is shown in Figure 3.13. #### 3.9.3 NEW PROPOSED HARDWARE PLATFORM The new proposed hardware platform (Figure 3.13) resembles a duplicate system in which two redundant hardware blocks are connected by means of two switches SW1 and SW2 to provide two active paths. The upper active path is the path in which the upper two IPs are active and the lower active path is the path in which the lower two IPs are active. The switches SW1 and SW2, which are used to change one active path with the other active path at predefined regular intervals of time (an example is provided in Chapter 4), are controlled by the central "Digital Processing" IP. The "Diagnose and Action" IP is responsible to calibrate the performance and diagnose the problem of each IP using an on-chip measurement architecture as proposed in [Xin10]. These on-chip measurement units, the Central Measurement Unit (CMU) and the Local Measurement Unit (LMU), are shown in Figures 3.14 and 3.15 respectively [Xin10]. Depending on the measurement requirements, each IP may have one or multiple LMUs. The width of the digital test bus (DTB) depends on the number of LMUs whereas a single-wire internal and a single-wire external analog test bus (IATB and EATB, Figure 3.15) is required. The Wrapper Boundary Register (WBR) and Wrapper Instruction Register (WIR) can be controlled by the "Diagnose and Action" IP. The WBRs are used to capture measurement data. (48) Figure 3.14: Block Diagram of the Central Measurement Unit (CMU) [Xin10] Figure 3.15: Block Diagram of the Local Measurement Unit (LMU) [Xin10] As described in [Xin10], the CMU consists of a DAC, a ramp generator, highfrequency oscillator, and a number of ripple counters. The high-frequency oscillator is used to run ripple counters which can be stopped by the signals on the DTB. The main part of the LMU is a comparator whose inputs are connected to the IATB and IP internal test points via analog switches (this is an invasive technique; a non-invasive technique will be discussed in Chapter 4). Whereas, the output of the LMU comparator controls the counter in the CMU via the DTB. The basic responsibility of the CMU is analog-to-digital conversion using the built-in ramp generator and counters, assisted by the LMU. The purpose of using a comparator in an LMU is twofold: it compares the analog signal at the test point with the ramp voltage and isolates the test point from parasitics in the test buses [Xin10]. Using this measurement architecture one can measure the voltages at different test points within the analog circuit. In order to measure DC currents, an additional I/V converter needs to be placed between the test point and the input of the comparator in the LMU. Further information can be found in [Xin10]. The purpose of these measurements is to estimate the performance of the analog and mixed-signal circuit/IP which can then be analysed by the "Diagnose and Action" IP to take further actions as discussed in section 3.10.2. #### 3.9.3.1 WORKING PRINCIPLE The working principle of our new proposed hardware platform (NPHP) (Figure 3.13) is different from the initial proposed hardware platform (Figure 3.2). In this NPHP, at predefined regular intervals of time the current active path (e.g. upper active path) will be changed with the current non-active path (i.e. lower active path) by using switches SW1 and SW2. If initially, all of the analog and mixed-signal IPs are fully 49 C H A P T E R Figure 3.16: Timing diagram showing when the upper and lower paths (Figure 3.13) are active. Figure 3.17: A possible fault tolerant architecture for switches SW1&2 (Figure 3.13) operational then replacing the upper active path with the lower active path will result in a fully dependable system limited by the replacement (switching) time. However, in the initial proposed hardware platform the current active path was changed with an alternate one only in the case when there were no more tuning options left in the current active path. Therefore, resulting in higher down time and hence degrading the availability of the front-end. Suppose let at time ' $t_0$ ' the upper active path (Figure 3.13) be the current active path as shown in Figure 3.16. Then at time ' $t_1$ ' the current active path will be replaced with the lower active path and each individual IPs in the upper active path will be diagnosed and calibrated by the "Diagnose and Action" IP using the on-chip measurement units (LMU and CMU). In case there is a mismatch between the measured values and the expected values proper tuning actions will be carried out by the "Diagnose and Action" IP using digital tuning knobs as shown in Figure 3.13. Once the calibration is completed, the upper active path will be activated again for normal operation by using switches SW1 and SW2. At this point in time ' $t_2$ ', IPs in the lower active path will be calibrated and diagnosed by the "Diagnose and Action" IP and if required, proper tuning actions will be taken by using the digital tuning knobs. The calibration and repair/tuning process of each path will take place in such a way that their performance will be kept within a predefined tolerance band to reduce the sensitivity at the output while switching one active path with the other active path by means of switches SW1 and SW2. The position of the switches, SW1 and SW2, in the new proposed hardware platform (NPHP) is very critical and any failure in these switches will result in a complete failure of the whole system. Therefore, to avoid single-point-of-failure these switches (SW1 and SW2) are made highly fault tolerant. A possible architecture could be a switch having parallel and serial duplication, as shown in Figure 3.17, where these switches are connected in parallel and in series for obtaining better dependability levels. The NPHP and its working principle will enhance the availability of the whole system because at any time one of the two paths (upper or lower) will be available for its service. This strategy will also provide a better way to diagnose the faults, if any, in the (50) R 3 Figure 3.18: Some calculations on the optimization of dependability attributes along with area, power and speed parameters [Ker11]. analog and mixed-signal IPs. Diagnosing the IPs of non-active path for its faults will provide a longer time for thoroughly evaluating the performance of each IP and if required, taking proper actions by using the on-chip digital tuning capabilities. These improvements will play an important role in enhancing the lifetime dependability of these mixed-signal front-ends, which are discussed in the next section. #### 3.9.4 ANALYSING DEPENDABILITY OF THE IMPROVED STRATEGY The dependability of the improved proposed strategy can be analysed at two levels: 1) at the design level and 2) at the proposed hardware platform level. Furthermore, a comparison can be made between the conventional duplicate or triplicate system and the proposed hardware platform (Figure 3.2) and new proposed hardware platform (Figure 3.13). #### 3.9.4.1 DEPENDABILITY IMPROVEMENT AT THE DESIGN LEVEL The dependability of analog and mixed-signal front-ends can be enhanced at the design stage by properly selecting the best combination of IPs. This can be analysed by using, for example, Linear Programming (LP) for optimizing as shown in Figure 3.18 [Ker11]. Here two IPs, an OpAmp and an ADC, have been optimized in an analog/mixed-signal front-end each having four different flavours with different dependability attributes along with their area, power and speed values. In Figure 3.18 the Rel, Av, and mean-time-to-repair (MTTR) parameters are used to represent the reliability (Rel), availability (Av), and maintainability (MTTR) attributes of the overall dependability of each of the sixteen possible combinations (OpAmp followed by an ADC) for an analog/mixed-signal front-end. A possible combination of OpAmp and ADC can be selected by using the calculations presented in Figure 3.18. This depends on the requirements set by the application or user. For example, case 8 can satisfy the dependability requirements with Rel > 85%, Av > 95%, MTTR < 500 ms, area Figure 3.19: Reliability of the new proposed hardware platform (NPHP) and initial proposed hardware platform (IPHP) with repairable IPs (WRIP) and conventional triplicate and duplicate systems (DS & TS) with non-repairable IPs (NRIP) ( $\lambda$ =1/1000). overhead < 1500 $\mu m^2$ , speed < 40 ns and the power overhead less than 2.5 mW. Further dependability enhancements can be achieved by using the new proposed hardware platform (NPHP). #### 3.9.4.2 Dependability Improvement at the Hardware System Level In order to calculate the dependability enhancements resulting from the NPHP, first the reliability enhancement has been calculated using the Markov analysis and Reliability Block Diagrams (RBDs) as discussed in section 3.3. Using Markov analysis. each repairable analog/mixed-signal IP can be divided into a number of states, and a state model can be constructed as discussed before in section 3.7. Solution of this state model, using the state-space approach, will give the probabilities that the system will be in different states at a particular time. Subsequently, the sum of the probabilities over non-failure states will provide the reliability of the system at that time. By having these reliabilities for each IP, the overall *reliability* of the whole system can be calculated using Reliability Block Diagrams as discussed in the section 3.7. Figure 3.19 shows the results of MATLAB reliability simulations for the NPHP. These simulations also give the comparison between conventional and proposed strategies. This shows that the reliability of a system using the NPHP with repairable IPs (WRIPs) has been increased as compared to a conventional duplicate and triplicate system (DS & TS) with nonrepairable IPs (NRIPs). This increase in reliability is less than the IPHP; it can also be increased up to 99.99 % by increasing the repair rate (e.g. 20 times) as shown in Figure 3.19. It is claimed that similar reliability improvements can be achieved with reduced hardware redundancy as compared to the IPHP (Figure 3.2). The bend in the reliability simulations show that initially the reliability of conventional strategies (duplicate and triplicate systems) and the proposed strategies (Figures 3.2 and 3.13) is similar. However, as the failure occurs the conventional strategies cannot cope with this failure whereas the proposed strategies are capable of maintaining the reliability of the system. The potential system failure possibilities in the current active path while diagnosing and 52 Figure 3.20: Availability of the new proposed hardware platform (NPHP) and initial proposed hardware platform (IPHP) with repairable IPs (WRIP) and conventional duplicate and triplicate systems (DS & TS) with non-repairable IPs (NRIP) (λ=1/1000). repairing the non-active path performance can be avoided by reducing the repair time of the non-active path. This effect can be minimized by carefully selecting the interval of time required to properly diagnose and repair faults, if any, relative to the system degradation profile (degradation rate). For example, if the performance of the current active path can move beyond specifications within five minutes then the repair time for the other non-active path should be less than five minutes and this non-active path should be activated before the current active path fails to deliver its service. The *availability* of the system, being the readiness for correct service, is directly related to the switching time from one active path to the other active path. By having zero (potentially) switching time the availability in the NPHP will be 100%. Figure 3.20 shows the availability simulations of conventional and proposed strategies. The availability of the IPHP can reach the availability of NPHP if the repair rate is increased. This means, the availability in case of NPHP is much higher as compared to conventional (DS & TS) and IPHP. The *maintainability*, being the probability that the system is successfully repaired after it fails, can be calculated again in a similar way as discussed in section 3.7. That is, by calculating the contribution of the repair mechanism (i.e. the introduction of repairable components and the central Diagnose and Action IP) to decrease the unreliability (1 - Reliability) or to increase the reliability of the system having no repairable components (i.e. no repair mechanism). This is directly related to the tuning capabilities for each IP. The larger the number of possibilities to digitally tune the performance of an IP, the higher will be the probability "Diagnose and Action" IP to tune an analog/mixed-signal IP in case of any performance failure. A comparison of maintainability, between the IPHP and the NPHP, is shown in Figure 3.21. This figure shows that the maintainability of the NPHP is less than the maintainability of the previously IPHP; however, similar results of maintainability can also be achieved in the NPHP by increasing the repair rate. The maintainability represents a successfully repair after a failure occurs. Therefore, initially the maintainability is zero because there are no failures in the system. However, as the Figure 3.21: Maintainability of the new proposed hardware platform (NPHP) and initial proposed hardware platform (IPHP) with repairable IPs (WRIP) and conventional duplicate and triplicate systems (DS & TS) with non-repairable IPs (NRIP) (λ=1/1000). time passes and the probability of failure increases the maintainability also increases as shown in Figure 3.21. Furthermore, the maintainability of the conventional duplicate and triplicate systems (DS & TS) with non-repairable IPs is also shown in Figure 3.21 for comparison purpose. To fully understand the behaviour, performance and contributions of the NPHP in enhancing the dependability of the whole system, a simulation setup consisting of a single active path has been considered which is discussed in the next sections. # 3.10 SIMULATING SYSTEM DEPENDABILITY IMPROVEMENTS In the previous sections we have simulated the dependability, mainly the reliability, maintainability, and availability, of the proposed strategies based on mathematical models consisting of failure rate and repair rate probabilities and the corresponding results were in terms of probabilities. However, in this section the new proposed strategy will be analysed by using behavioural models in a VHDL-AMS environment to evaluate the results in terms of waveforms. System-level behavioural models, that are frequently employed in industry to study the behaviour of analog and mixed-signal circuits, are used here to investigate the dependability improvements of the NPHP. In order to use these behavioural models, first the ideal behavioural models of the whole system and its sub-systems have to be constructed and simulated. Second, the models of the degraded performance parameters have to be incorporated in the ideal behaviour models to investigate their effect on the ideal performance of the system. Third, behavioural models of the repairing mechanism and associated repairing options have to be incorporated to simulate their contributions in improving the dependability of the whole system. 54 Figure 3.22: Block diagram of the behavioural model of a temperature sensor #### 3.10.1 BEHAVIOURAL MODELLING Hardware Description Languages (HDL's) have been used since the sixties to model and simulate systems. The behavioural mechanism of an HDL allows the user to express the operation of a system at various levels of abstraction; for example it could be very detailed or highly abstract or anything in between. These HDL's can be divided into digital, analog and mixed-signal HDL's. Digital HDL's are based upon event-driven techniques and discrete models whereas analog HDL's support solutions of differential and algebraic equations. VHDL-AMS, being a mixed-signal language, supports both event-driven and differential/algebraic equation based techniques and is hence suitable for analog, digital and mixed-signal systems [Chr02]. ## 3.10.1.1MODELLING ANALOG AND MIXED-SIGNAL FRONT-END The differential/algebraic equations describing the behaviour of any analog or mixed-signal front-end can be easily implemented in VHDL-AMS if the whole behaviour of the analog/mixed-signal front-end is divided into simpler behavioural subblocks. In other words, if the complex functionality of any analog/mixed-signal front-end is divided into a number of successive simpler functional sub-blocks one can implement this complex function in a more structured way using VHDL-AMS. After selecting these simpler behavioural sub-blocks the next step is to extract their mathematical or algebraic equations and how they will interact with each other based on the fundamental circuit theory and experimental observations. For example, the next sections will discuss how the behavioural models of a temperature sensor, operational amplifier and an analog-to-digital converter are constructed. # A. MODELLING A TEMPERATURE SENSOR Figure 3.22 shows a block-diagram of the behavioural model of a temperature sensor. This model can easily be extracted from the most popular equation for a parasitic bipolar PNP transistor used in standard CMOS technology [Bak02] being: $$V_{ptat} = \frac{kT}{q} ln \left[ \frac{I_1}{I_2} \right]$$ (3.13) Where ' $I_1$ ' and ' $I_2$ ' are collector currents at two different base-emitter voltages ( $V_{BE}$ ). The above equation shows that if the ratio of the two currents is kept constant then the right-hand side of the equation will be constant except the temperature 'T'. Therefore, one can replace this constant term with a constant gain and hence the whole temperature sensor can be modelled as a transducer, which converts the temperature values into a proportional voltage multiplied by a constant gain factor as shown in Figure 3.22. 55 C H A P T E R Figure 3.23: Block diagram of the behavioural model of an operational amplifier Figure 3.24: Block diagram of the behavioural model of an operational amplifier (OpAmp) with degradations # B. MODELLING AN OPERATIONAL AMPLIFIER The behaviour of an operational amplifier can be represented by four sub-blocks as shown in Figure 3.23. The "input" sub-block represents input parameters like input resistance, input capacitance, input current and input offset voltage. The "GAIN" block represents the gain of the operational amplifier. Depending upon the user, this can be open-loop or closed-loop gain. One can describe this gain as a function of different gain parameters for example differential gain, common-mode gain, and power-supply gain. Common-mode gain and power-supply gain can be extracted from the operational amplifier data-sheet parameters like common-mode rejection ratio (CMRR) and power-supply rejection ratio (PSRR). Once these parameters are obtained, one can formulate a mathematical equation that can describe the gain of the complete operational amplifier. The "Frequency Response" sub-block represents how the operational amplifier is reacting to different input frequencies. This can be obtained from the transfer function composed of poles and zeros of the operational amplifier [Lun04]. VHDL-AMS provides a way to simulate this transfer function based upon vectors composed of poles and zeros of this transfer function [Ash03]. The "Output" sub-block represents parameters like output resistance, output current, output DC offset, output delay, output saturation voltages, and output slew rate. After specifying these parameters one can formulate mathematical equations describing the behaviour of the operational amplifier output. One can also insert different performance parameter degradations as shown in Figure 3.24. This can be implemented in VHDL-AMS along with other sub-blocks of the analog/mixed-signal front-end to analyse the dependability of the whole system under these degradations. #### A. MODELLING THE ANALOG-TO-DIGITAL CONVERTER The functional behaviour of the ADC can also be divided into a number of subblocks as shown in Figure 3.25 (details of a successive approximation register ADC are discussed in Chapter 6). The "Sample & Hold" sub-block is responsible to take samples at regular intervals of time defined by the sampling frequency. The "Input Limiter" sub-block puts a limit on input voltage range and can be used to raise flags like overflow or underflow. The "GAIN" sub-block could be a constant number multiplier or a unity multiplier. The "Output Offset" sub-block represents the ADC output offset voltage. The "Quantizer" sub-block is responsible to map different continuous input voltage values to corresponding discrete values. Whereas, depending on the user, the "Output Encoder" sub-block is used to encode the output. For example the output could be encoded in binary, gray or temperature code. Similarly, degradation of different performance parameters as a result of global degradation effects can also be GAIN Input Limiter C H A P Т E R 3 Output Offset Quantizer Output Encoder Figure 3.26: Block diagram of the behavioural model of an analog-to-digital converter (ADC) with degradations inserted in these behavioural models to see their effects on the whole system performance as shown in Figure 3.26. # 3.10.2 SIMULATION SETUP Sample & Hold Figure 3.27 shows the block diagram of the simulation setup where a conventional front-end composed of a temperature sensor, operational amplifier, and an analog-to-digital converter has been modelled as described above. These behavioural models are further simulated in VHDL-AMS [Bak02, Ash03, Chr02, Lun04]. For illustration purposes, the upper path composed of OpAmp IP1 and ADC IP1 (Figure 3.27) has been considered as the current active path. To investigate the system-level dependability of such a system (front-end), one first has to consider which performance parameters are crucial from a dependability point of view; subsequently the influence of those performance parameters have to be included in the behavioural models. In case of degradation in the performance parameters and their influence on the whole system, these crucial performance parameters can be obtained from design-stage aging-simulations for each individual IP [Wan11]. To simplify things, as an example, the offsets of both IPs (OpAmp and ADC) have been considered as an important dependability issue due to aging degradations. Furthermore, it has been assumed that each OpAmp and ADC has two digital tuning knobs as discussed before in section 3.7. Moreover, only the operational amplifier and the analog-to-digital converter are considered to be influenced by aging degradations. For simplicity, the temperature sensor has been considered independent of aging degradations. This, for example, can be assumed in case the temperature sensor is a silicon temperature sensor [Kty00]. Hence in this particular case, the offset-parameter degradations of both these IPs will degrade the performance of the whole analog/mixed-signal front-end. #### 3.10.2.1SIMULATION RESULTS Figure 3.29 shows a screen-shot of the simulation results of the simulation setup shown in Figure 3.27. At predefined regular intervals of time the upper active path will be replaced with the lower active path by the "Digital Processing" IP using switches SW1 and SW2. The "Digital Processing" IP will also activate the "Diagnose & Action" IP to monitor the performance of OpAmp IP1 and ADC IP1 using the on-chip measurement architecture composed of CMU and LMUs [Xin10]. At each of these points in time, the offset-parameter of both IPs is calculated and compared against their specifications. Once an unacceptable degradation in the offset-parameter of OpAmp IP1 or ADC IP1 is diagnosed, proper repair actions by digitally tuning the OpAmp Figure 3.27: Simulation setup of the target system Figure 3.28: Proposed timing diagram showing when the upper and lower paths (Figure 3.27) are active for a balanced workload on each path. and/or ADC offset-parameter back to its allowed value are taken by the "Diagnose & Action" IP as encircled in Figure 3.29. Furthermore, once the digital tuning is done, the upper active path will be activated again to perform its normal operation. Replacing the current active path with another fully functional active path (i.e. the upper active path with the lower active path in the current case) will also increase the availability of the system as discussed before in section 3.9.4. In order to minimize the complete failure of the current active path, this can occur in case all of the digital tuning options of the current active path are completely used up while the digital tuning options of the other active path are still remaining, both of the active paths can be activated for the normal operation for a predefined duration of time as shown in Figure 3.28. This duration of time is different from the predefined interval of time for regular performance diagnosis. This is to make sure that both the active paths are used at their maximum possible potential for making the whole system more dependable. As an example, 5mV has been assumed as the specification boundary for both OpAmp and ADC offsets. It means a failure occurs if the offset voltage of OpAmp or ADC moves beyond the 5mV threshold. Similarly, a constant degradation rate of value 5mV/3 hours and 5mV/4.5 hours has been assumed for OpAmp and ADC offsets respectively. Figure 3.29 shows that the first and second repair of OpAmp offset occurs at 3 and 6 hours respectively whereas the first and second repair of ADC offset occurs at 4.5 and 9 hours respectively. This means in case of a duplicate system, if similar failures occur in a non-repairable OpAmp and ADC at the same time, the non-repairable OpAmp and ADC will be not be available for their service after 6 and 9 hours respectively. However, as a result of properly diagnosing and taking repair actions in 58 3 59 Figure 3.29: Advance VHDL-AMS simulations of the new proposed hardware platform (Figure 3.27). the NPHP, the OpAmp and ADC are functioning correctly and are available for their correct service for a longer time. This means the NPHP has provided its services for a longer time as compared to a conventional duplicate system. In short, by providing the digital tuning and switching mechanism, the dependability of the whole system has been enhanced or can be maintained for a longer time as compared to the conventional duplicate system with non-repairable IPs. These simulations provide results in terms of waveforms that are based on the behavioural models of the system. However, in Chapter 4 we will discuss simulation results based on the mean-time-between-failures (MTBF) of the system to have more numerical results instead of waveforms based on behavioural simulations. # 3.11 CONCLUSIONS In this chapter we have proposed a strategy for dependability enhancement and its analysis for analog and mixed-signal SoCs. Different degradation mechanisms for dependability of the integrated circuits like NBTI, HCI and TDDB have become quite apparent in recent submicron technologies and this requires strategies starting from circuit design phase to end-of-life of a product. Markov processes and reliability block diagrams have been utilized to theoretically analyse the reliability, maintainability, and availability, being the important attributes of dependability, of the proposed strategy. The presented strategy combines the efforts at the design stage of IPs and up to the system-level. Initial simulations at the design stage can be utilized to establish a library of dependable IPs by utilizing digitally-assisted and digital repairing techniques for mixed-signal SoCs. Simple optimization techniques based on Linear Programming (LP) can further lead to the selection of the best possible combination of IPs from these libraries of dependable IPs for better values of dependability attributes. These decisions can be further combined with a new proposed hardware platform having on-chip measurement and digital tuning capabilities with some switching techniques for enhancing the dependability of analog and mixed-signal SoCs. The two proposed strategies have been simulated and compared against conventional duplication and triplication strategies at theoretical (mathematical) and behavioural simulation levels for a target system consisting of a temperature sensor, operational amplifier and ADC. Both of the theoretical and behavioural simulation results show that utilizing the digital programming/tuning capabilities of different analog and mixed-signal IPs and a switching mechanism provides a potential possibility for dependability enhancement of future critical systems, especially for analog and mixed-signal front-ends. Further improvements in the proposed strategy will be discussed in coming chapters. #### 3.12 REFERENCES [Ash03] P.J. Ashenden, G.D. Peterson, and D.A. Teegarden "The system designer's guide to VHDL-AMS: analog, mixed-signal, and mixed-technology modeling," Morgan Kaufmann, Publishing Press, ISBN 1-558-60749-8, 2003. 3 61 [Avi01] A. Avizienis, J-C. Laprie, and B. Randel, "Fundamental concepts of dependability," in Laboratory for Analysis and Architecture of Systems (LAAS-CNRS) Technical Report no. 01-145, Apr. 2001. [Avi04] A. Avizienis, J-C. Laprie, B. Randell and C. Landwehr, "Basic concepts and taxonomy of dependable and secure computing," in IEEE Trans. Dependable and Secure Computing, Vol. 1, No. 1, pp. 11-33, 2004. [Bak02] A. Bakker, "CMOS smart temperature sensors- an overview," Proc. IEEE Sensors, Vol. 2, pp. 1423-1427, 2002. [Bao09] Y. Baoguang, F. Qingguo, J.B. Bernstein, Q. Jin, and D. Jun, "Reliability Simulation and Circuit-Failure Analysis in Analog and Mixed-Signal Applications," in IEEE Trans. Device and Materials Reliability, Vol. 9, No. 3, pp. 339-347, 2009. [Buj04] G. Buja, S. Castellan, R. Menis and A. Zuccollo, "Dependability of safety-critical systems," in Proc. IEEE Int. Conf. Industrial Technology, Vol. 3, pp. 1561-1566, 2004. [Buj06] G. Buja and R. Menis, "Conceptual frameworks for dependability and safety of a system," in Proc. IEEE Int. Symp. Power Electronics, Electrical Drives, Automation and Motion, pp. 44-49, May 2006. [Buk06] J.V. Bukowski, "Using Markov Models to Compute Probability of Failed Dangerous When Repair Times Are Not Exponentially Distributed," in Proc. IEEE Int. Reliability and Maintainability Symp., pp. 273-277, 2006. [Cha07] P. Chaparala, D. Brisbin, K. Jonggook and B. OConnell, "Reliability challenges in analog and mixed signal technologies," in IEEE Int. Symp. Physical and Failure Analysis of Integrated Circuits, pp. 135-140, 2007. [Cha10] W.L. Chang, J.Y. Luo, Y. Qi, and B. Wang, "Reliability and Failure Analysis in Designing a Typical Operation Amplifier," in IEEE Int. Symp. Physical and Failure Analysis of Integrated Circuits (IPFA), pp. 1-4, 2010. [Chr02] E. Christen, and K. Bakalar, "VHDL-AMS - a hardware description language for analog and mixed-signal applications," in IEEE Trans. J. Circuits and Systems II: Analog and Digital Signal Processing, Vol. 46, No. 10, pp. 1263-1272, 2002. [Jha05] N.K. Jha, P.S. Reddy, D.K. Sharma, and V.R. Rao, "NBTI Degradation and Its Impact for Analog Circuit Reliability," in IEEE Trans. Electron Devices, Vol. 52, No. 12, pp. 2609-2615, 2005. [Ker10] H.G. Kerkhoff, and J. Wan, "Dependable digitally-assisted mixed-signal IPs based on integrated self-test & self-calibration," in Proc. IEEE Int. Mixed-Signals, Sensors and Systems Test Workshop, pp. 1-6, 2010. [Ker11] H.G. Kerkhoff, "New View Requirements for Analogue/MS IPs for Dependability Optimization in Heterogeneous SoC Design," in Proc. IEEE Int. Mixed-Signals, Sensors and Systems Test Workshop, Santa Barbara, California, May 2011. [Kha11a] M.A. Khan, and H.G. Kerkhoff, "A System-Level Platform for Dependability Enhancement and its Analysis for Mixed-Signal SoCs," in IEEE Int. Symp. Design and Diagnostics of Electronic Circuits & Systems (DDECS), pp. 17-22, 2011. [Kha11b] M.A. Khan, and H.G. Kerkhoff, "SoC Mixed-Signal Dependability Enhancement: A Strategy from Design to End-of-Life," in IEEE Int. Symp. Defect and Fault Tolerance in VLSI and Nanotechnology Systems (DFT), pp. 374-381, 2011. [Kty00] KTY83-1 series, "Silicon Temperature Sensor," Data Sheet, Philips Semiconductor, 2000. [Kum09] R. Kumar and A. Jackson, "Accurate Reliability Modeling using Markov Analysis with Non-Constant Hazard Rates," in Proc. IEEE Int. Aerospace Conf., pp. 1-7, 2009. [Liu06] Z. Liu; B.W. McGaughy, and J.Z. Ma, "Design tools for reliability analysis," in IEEE Design Automation Conference (DAC), pp. 182-187, 2006. [Lun04] K.H. Lundberg, "Internal and external op-amp compensation: a control-centric tutorial," in Proc. IEEE Conf. American Control Conference, Vol. 6, pp. 5197-5211, 2004. [Mat14] J. Mathew, R.A. Shafik, and D.K. Pradhan, "Energy-Efficient Fault-Tolerant Systems," Springer Publishing Press, ISBN 978-1-4614-4193-9, 2014. [Mur06] B. Murmann, "Digitally Assisted Analog Circuits," in IEEE Dallas Workshop on Design, Applications, Integration and Software, pp. 23-30, 2006. [Mur07] B. Murmann, and B.E. Boser, "Digital Domain Measurement and Cancellation of Residue Amplifier Nonlinearity in Pipelined ADCs," in IEEE Trans. Instrumentation and Measurement, Vol. 56, No. 6, pp. 2504-2514, 2007. [Ros05] J.M. de la Rosa, et al., "A CMOS 110-dB@40-kS/s programmable-gain chopper-stabilized third-order 2-1 cascade sigma-delta modulator for low-power high-linearity automotive sensor ASICs," in IEEE Journal of Soilid-State Circuits, Vol. 40, No. 11, pp. 2246-2264, 2005. [Sir04] E. Siragusa, I. Galton, "A Digitally Enhanced 1.8-V 15-bit 40-MSample/s CMOS Pipelined ADC," in IEEE Journal of Solid-State Circuits, Vol. 39, No. 12, pp. 2126-2138, 2004. [Vas03] I. Vassiliou, et al., "A single-chip digitally calibrated 5.15-5.825 GHz 0.18-μm CMOS transceiver for 802.11a wireless LAN," in IEEE Journal of Solid-State Circuits, Vol. 38, No. 12, pp. 2221-2231, 2003. [Wan11] J. Wan, and H.G. Kerkhoff, "Boosted gain programmable OpAmp with embedded gain monitor for dependable SoCs," in IEEE Int. SoC Design Conference (ISOCC) 2011, pp. 294-297, 2011. [Xin10] Y. Xing, and L. Fang, "Design-for-Test of Digitally-Assisted Analog IPs for Automotive SoCs," in IEEE Asian Test Symposium, pp. 185-191, 2010. # RUNTIME RELIABILITY ESTIMATIONS AND SYSTEM DEPENDABILITY ABSTRACT — In Chapter 3 it has been proposed that performance parameters can be monitored individually and can be digitally tuned back to their specifications in order to enhance the system dependability. In this chapter it will be discussed that instead of monitoring all the performance parameters only the potential critical performance parameter(s), in terms of degradation effects acquired via simulations at design stage, can be monitored and digitally tuned back accordingly. It will also be explained that the degradation in system-level performance parameters is directly influenced by variations and degradation in the device-level parameters. Hence the degradation in these parameters can be considered as a potential possibility for estimating reliability during the operational life of a system. These runtime (during the operational life) reliability estimations could be further used to enhance the dependability of the whole system as proposed in the dependability workflow. Furthermore, in order to avoid potential circuit overloading problems while directly interacting with the critical (internal) nodes of the system for runtime performance parameters measurements, an indirect technique for reliability estimations is also discussed in this chapter. Design-stage reliability simulations over a range of input-stress voltages and working-stress temperatures can be conducted for a critical performance parameter. These simulations can be used to generate the degradation profile over time. That can be used at system level to estimate the degradation in that particular performance parameter. Hence with this technique we can indirectly estimate the reliability of a system by regularly monitoring the inputstress voltages and working-stress temperatures. Our proposed strategy is subsequently simulated for a target system in the LabVIEW environment for its validity. # 4.1 Introduction It has been discussed in the previous chapters that system dependability is becoming an important aspect of critical applications, as the technology is moving towards smaller dimensions. Similarly, it has also been discussed that the dependability of a system can be viewed as a collection of system attributes like availability, reliability, maintainability, safety, security and survivability [Avi01]. Among these attributes reliability can be considered as the most important attribute because reliability estimations at design stage are essential to safely guard band the system performance for a certain life time. Reliability estimations during the operational life of a system are crucial for a dependable system design. In this case, reliability can be regularly or continuously estimated and proper actions can be anticipated in advance in order to achieve a higher availability, proper maintainability and hence better dependability of the system. Therefore, it is important to know how the reliability of a system is influenced by different physical mechanisms and how it can be monitored or estimated especially during its operational life. This necessitates the need of a methodology or a technique either by directly monitoring the performance or by using indirect means for monitoring the reliability during the operational life; this especially holds for analog and mixed-signal systems as discussed in this chapter. Fabrication-related process variations, due to deviations in fabrication of small featured sized transistors, and different physical mechanisms like negative-bias temperature instability (NBTI), hot carrier injection (HCI), time-dependent dielectric breakdown (TDDB), and electro migration (EM), are the major causes affecting the circuit reliability. These fabrication-related process variations [Lat11, Gao10] and different degradation mechanisms have been discussed separately [Mar11, Jha05] and integrated in literature [Lu09, Ala07] to address their impact on the reliability of ICs. Among these mechanisms, NBTI is considered to be the major contributor in CMOS aging [Mar09, Sch03] that determine the lifetime of CMOS devices and systems. NBTI occurs in case a p-type MOS device is stressed with a negative bias voltage at elevated temperatures. Due to the NBTI effect the threshold voltage ( $V_{th}$ ) of the device will increase temporary which in turn can result in the temporal degradation of the system behaviour/performance. This degradation is strongly dependent on the duration of time for which the stressors have been applied. On the other hand, NBTI degradation itself depends on the initial threshold voltage ( $V_{th}$ ) of a p-type MOS device [Kri10]. Similarly, fabrication-related process variations will introduce variations in the initial threshold voltage and they will further affect the NBTI behaviour or the reliability of every device and hence the whole system. This means the NBTI behaviour or the reliability of each device and thus the whole system will be *different* and *initial-value* dependent. This highlights the importance of monitoring reliability during the operational life of a system despite the usual concept of reliability estimations at design stage. On the other hand, in order to estimate system-level reliability during the operational life it is required to investigate system-level reliability estimation techniques. This involves the study of variations in system-level parameters as a result of variations of device-level parameters due to process variations as well as aging effects. The goal of this chapter is to establish that system-level parameters are indeed linked to device-level parameters and variations in system-level parameters have a performance variation/degradation trend as that of variation/degradation trend in device-level parameters. Furthermore, it is discussed that variations/degradations in system-level parameters might be regularly monitored to estimate the reliability of systems during their operational life. However, this can impose potential circuit overloading effects while directly interacting with the internal nodes for performance monitoring. Therefore, an indirect technique for estimating the reliability of electronic systems that minimally interacts with the critical (internal) nodes, to overcome the potential circuit overloading effects, is also presented in this chapter. The main idea is to use design-stage performance degradation estimations for calculating a set of values that can be later used at system level as an indirect way of estimating the reliability of the overall system. These direct and indirect means of R 4 Figure 4.1: Hierarchical flow of system specifications. performance degradation estimations for the reliability estimations will further play an important role in improving the dependability (reliability, maintainability, and availability) of these electronic systems. The rest of this chapter is organized as follows. Section 4.2 will describe the hierarchical flow of system specifications. How variations and degradation at device-level parameters can affect system-level parameters and how system-level parameters can be used to quantitatively estimate the reliability of electronic systems during their operational life is discussed in sections 4.3 - 4.6. Next it is explained in section 4.7 how these runtime (during operational life) reliability estimations can be further used to enhance the dependability of a complex system; the corresponding simulation results are presented in section 4.8. Furthermore, in order to avoid the potential circuit overloading problems while directly interacting with the internal nodes of a system to measure system-level performance parameters, an indirect technique to estimate the reliability of electronic systems is presented and explained in section 4.9. The conclusions and important references are given in sections 4.10 and 4.11 respectively. ## 4.2 HIERARCHICAL FLOW OF SYSTEM SPECIFICATIONS Typically, electronic system design often follows a top-down hierarchical flow. Therefore, specifications of the system are translated into specifications of its subsystems and this process continues until the sub-systems are well known building blocks, such as, current mirrors, and differential amplifiers in case of analog circuits, or adders, subtractors and flip-flops in case of digital circuits. This is called the circuit level. In this hierarchical flow, the intrinsic parameters of transistors like dopant concentration, oxide thickness, and channel length etc. are at the lowest level which is called the process level. Whereas, the next level consists of transistor parameters like transconductance, output impedance or capacitances between drain, source and gate terminals. This is referred to as the device level. This then further goes to circuit level and then up to system level as shown in Figure 4.1. This hierarchical flow of specifications shows that system-level parameters are indeed linked to lower-level parameters via circuit-level and device-level parameters. Therefore, any variations in device-level parameters due to process variations or aging effects can also affect the circuit-level and therefore system-level parameters. The next section will briefly discuss which variations can be normally expected in device-level parameters due to shrinking technology nodes, related process variations and aging effects and how they can influence system-level parameters. Among the various device- Figure 4.2: Typical transistor threshold voltage standard deviation ( $\sigma V_{th}$ ) normalized to the threshold voltage ( $V_{th}$ ) for several technology nodes (extracted from [Chi07]). level parameters, the main focus has been on threshold voltage $(V_{th})$ since process variations, intra-die variations and aging effects will all affect the threshold voltage $(V_{th})$ ; furthermore, threshold voltage $(V_{th})$ has been thoroughly discussed in literature as well. #### 4.3 VARIATIONS IN SYSTEM-LEVEL PARAMETERS Variations in system-level parameters will increase with continuing shrinking device dimensions as process variations and intra-die variability will give rise to variations in parameter characteristics at device level. Therefore the reliability of circuits, especially analog circuits since most analog devices rely on matched parameters, is expected to diminish. Three different cases have been considered here which show how variations at lower-level parameters will affect higher-level parameters in digital as well as analog circuits. Case 1: In this case, variations in the technology node and its intrinsic process parameters like random dopant fluctuation (RDF), line-edge roughness (LER), and oxide thickness fluctuation (OTF) and the corresponding variations in the threshold voltage ( $V_{th}$ ) are considered. For example, Figures 4.2 and 4.3(a) show respectively the variations in the threshold voltage due to shrinking technology-nodes [Chi07] and intrinsic process variations [Ye10]. Similarly, Figure 4.3(b) shows the corresponding variations in the delay of an inverter [Ye10] as a function of similar variations in the threshold voltage ( $V_{th}$ ). This indicates that variations in the delay of an inverter, propagation delay being a system-level parameter, follow the same variation trend as the variation trend in the device-level threshold voltage ( $V_{th}$ ) parameter. Case 2: In this case, variations in the threshold voltage ( $V_{th}$ ) due to different process corners and the corresponding variations in the output voltage, being a system-level performance parameter, of a reference voltage generator are shown in Figures 4.4(a) and 4.4(b) [Lin06] respectively. This indicates that like digital circuits, as mentioned above, analog circuits exhibit a similar variation trend in system-level parameters as the variation trend introduced at device- 4 Figure 4.3: a) Trend of ' $\sigma V_{th}$ ' in a typical transistor for different technologies as a result of different processinduced intrinsic variations like random dopant fluctuation (RDF), line-edge roughness (LER), and oxide thickness fluctuation (OTF) (extracted from [Ye10]) b) Mean ( $\mu$ ) and normalized standard deviation ( $\sigma/\mu$ ) of inverter delay under random process variations for different technologies (extracted from [Ye10]). Figure 4.4: a) $V_{th}$ variation as a function of temperature for a typical transistor in fast (FF), typical (TT) and slow (SS) process technology corners (extracted from [Lin06]) b) Output reference voltage in different process corners using circuit proposed in [Lin06]. level parameters due to technology-node shrinkage or process variations. Figure 4.5 further explains this connection. A single value for each device-level parameter ( $P_{D1}$ , $P_{D2}$ , $P_{D3}$ etc.) will produce a single value for each corresponding system-level parameter ( $P_{S1}$ , $P_{S2}$ , $P_{S3}$ etc.). By having a distribution for each device-level parameter a distribution for each system-level parameter will be produced. Case 3: In this case, bias temperature instability (BTI) which is another cause of variations in the threshold voltage ( $V_{th}$ ) has been considered. It is noted that BTI has two components: negative bias temperature instability (NBTI) which results in PMOS transistor degradation, and positive bias temperature instability (PBTI) which results in NMOS transistor degradation. The variations in threshold voltage ( $V_{th}$ ) as a result of BTI further depend on the initial threshold voltage ( $V_{th}(t_0)$ ) [Kri10]. Figures 4.6(a) and 4.6(b) show respectively the variations in the NBTI induced threshold voltage of a PMOS transistor at three different technology corners and the corresponding variations in the propagation delay, a system-level parameter, of a five-stage ring oscillator as a function of stress time [Kri10]. These figures indicate that a Figure 4.5: Link between device-level parameters and system-level parameters a) Single value of device-level parameters will result in single value of system-level parameters b) Distribution of device-level parameters will result in distribution of system-level parameters. Figure 4.6: a) Percentage of $V_{th}$ shift in a typical PMOS transistor for three technology corners: low, nominal, and high $V_{th}$ at 25°C and 100°C due to NBTI effect over two years (extracted from [Kri10]). b) Percentage of delay degradation of a five-stage ring oscillator for three different $V_{th}$ corners. Low $V_{th}$ circuit shows more degradation compared to high $V_{th}$ circuit (extracted from [Kri10]). similar variation trend is seen in the propagation delay of a five-stage ring oscillator as the variation trend in the threshold voltage $(V_{th})$ . This further indicates that the NBTI-induced delay degradation is initial-value dependent and is higher at low initial $V_{th}$ value and vice versa. The above presented three cases show that system-level parameters are indeed linked to device-level parameters. Therefore, process variations as well as aging effects can have effect on device-level parameters (e.g. $V_{th}$ ), and similar variations could be expected in system-level parameters (e.g. propagation delay in Figures 4.3(b) and 4.6(b), and reference voltage in Figure 4.4(b)) being connected to device-level parameters. Another important finding is the dependence of the degradation rate on the initial threshold voltage ( $V_{th}(t_0)$ ) values (Figure 4.6(b)). This shows that two identical circuits will degrade differently having different degradation rates and behaviours as they have different $V_{th}$ values at the start as a result of process variations. Large $V_{th}$ variations have been observed as technology nodes are shrinking towards 22nm as a result of their corresponding process variations [Chi07, Ye10, Lin06]. Therefore, these 68 69 C H A > P T E R 4 Figure 4.7: Temporal movement of the mean and the spread of standard deviation of the output swing voltage of 300 LC-VCO circuits due to aging (extracted from [Mar09]) at different points in time ( $t_0 < t_1 < t_3 < t_4$ ). Figure 4.8: Temporal movement of the mean and the shrinkage of standard deviation of the delay of 'N' number of ring oscillators due to aging (extracted from [Wan08]) at different points in time ( $t_0 < t_1 < t_3 < t_4$ ). large initial $V_{th}$ variations could result in large initial variations in system-level parameters which will further degrade differently [Kri10] over time resulting in further variations as a function of time (Figures. 4.6(b), 4.7, and 4.8). ## 4.3.1 PARAMETER VARIATIONS VS TEMPORAL DEGRADATIONS Figure 4.7 shows the probability density function (PDF) of the output voltage swing of 300 LC-VCO (Voltage Controlled Oscillator) samples subject to process variability as a function of temporal degradations; the data has been extracted from [Mar09]. At time $t_0$ the spread of output voltage swing is due to the process variations at the start. Furthermore, this figure shows that due to temporal aging degradations the mean of the output voltage swing is decreasing while the dispersion of the output voltage swing is increasing. This means, the LC-VCO samples having an initial output voltage swing less than the mean value have degraded faster, compared to the LC-VCO samples having an initial output voltage swing higher than the mean value. Similarly, Figure 4.8 shows the probability density function (PDF) of the delay of 'N' number of ring oscillators subject to process variations as a function of temporal degradations; this data has been extracted from [Wan08]. This figure shows that the mean delay of the 'N' number of ring oscillators is increasing whereas the dispersion of the 'N' number of ring oscillators is decreasing. This means, the ring oscillators having an initial delay, as a result of process variations, less than the mean value have degraded faster, compared to the delay of ring oscillators having an initial delay higher than the mean value. The above findings give us important insight that the system-level performance parameters of *identical* circuits will exhibit different degradation behaviour depending on the nature of the performance parameter and its associated circuit. For example, in the above case, the output voltage swing of LC-VCO oscillator is decreasing with time. While, on the other hand, the delay of a ring-oscillator is increasing with time. Furthermore, depending on the initial value of the performance parameter the degradation rate of that performance parameter for *identical* circuits is different. This means that among *identical* circuits, some circuits will have a faster degradation rate while others will have a slower degradation rate. This relationship between device-level and system-level parameters and their temporal-degradation dependence on initial values demands a runtime (during operational life) strategy to tackle their dependability issues, which is discussed in the next sections. # 4.4 RUNTIME RELIABILITY REQUIREMENTS As discussed in the previous chapter, the reliability of a system at a particular time gives an estimate of the probability for which the system works correctly for the purpose it has been designed for at that time. In other words, it can be defined as the probability that the performance parameters will remain within the designed specifications of the system at that time. This provides a solid foundation that any anticipated change from the designed specifications of system-level performance parameters will provide an estimate of the reliability. These changes could be a result of a mix of process variations and/or aging effects as discussed above. Therefore, monitoring variations in system-level performance parameters during the operational life and a comparison with the designed system specifications will provide the basis for an estimate of the process variations and the aging effects respectively and hence a technique to estimate the reliability of the system. Initial $(t = t_0)$ deviations of performance parameters from the design specifications can give an estimate of deviations as a result of process variations. Whereas, temporal $(t > t_0)$ deviations of performance parameters from the designed specifications can give an estimate of aging effects of the system. Similarly, in the previous sections it has been discussed that aging-related degradation mechanisms will have the same effect on the performance parameters of identical systems but the initial variation in these performance parameters due to fabrication-related process variations will determine the overall lifetime of each system. In order to further understand this let there be 'N' number of identical systems having the density function (DF) of an arbitrary performance parameter 'P' as shown in Figure 4.9. Depending on the system properties, this parameter DF can move either to the left hand side or to the right hand side (i.e can decrease or increase) due to aging effects as discussed in the previous section. The three regions, referred to as "allowed", "not allowed", and "permanent failure", provide information about the boundaries of the specification and are application dependent. In our case, three scenarios have been considered as shown in Figure 4.10. Figure 4.9: Temporal movement of parameter Density Function (DF) of an arbitrary performance parameter 'P' of 'N' identical systems under uniform degradation due to aging effects. Figure 4.10: Lifetime of three identical systems having a different initial value for an arbitrary performance parameter 'P'. **Scenario 1:** The system has a performance parameter 'P' which is near to the expected mean value. **Scenario 2:** The system has a performance parameter 'P' that is in-between the mean value and the "allowed" boundary. **Scenario 3:** The system has a performance parameter 'P' that is far from the mean value and very close to the "allowed" boundary. All of the three scenarios have the original performance parameter P within the acceptable range of specifications for an arbitrary application. Here two cases could be further possible as discussed below: Case 1: In this case, it is assumed that a temporal degradation in all 'N' identical systems is irrespective of their initial performance parameter values and is in the right hand side direction. In this way, the firstscenario, having a higher margin to go beyond the allowed upper boundary, has the longest lifetime. The second and the third scenario, being close to the allowed upper boundary, will have shorter lifetimes compared to the first system. In other words, the system in the third scenario will fail more rapidly compared to the other two systems and have a shorter operational life compared to the other two scenarios. Figure 4.11: Temporal movement of parameter Density Function (DF) of an arbitrary performance parameter P of 'N' identical systems under non-uniform degradation due to aging effects (standard deviation is increasing). Figure 4.12: Temporal movement of parameter Density Function (DF) of an arbitrary performance parameter 'P' of 'N' number of identical systems under non-uniform degradation due to aging effects (standard deviation is decreasing). Case 2: In this case, the degradation rate of 'N' number of identical systems is not uniform and it depends on the initial values as mentioned in the previous section. Similarly, the architecture of the system will determine which performance parameter will move to the right hand side direction or to the left hand side direction (i.e will increase or decrease). This means there will be four possible sub-cases as shown in Figures 4.11 and 4.12. That is for an arbitrary performance parameter 'P' of 'N' identical systems: **case 2a:** the dispersion of the parameter Density Function (DF) is *increasing* and the mean is moving to the *right*-hand side (i.e. increasing) as shown in Figure 4.11 (*right*-hand side of mean value). **case 2b:** the dispersion of the parameter Density Function (DF) is *increasing* and the mean is moving to the *left*-hand side (i.e. decreasing) as shown in Figure 4.11 (*left*-hand side of mean value). **case 2c:** the dispersion of the parameter Density Function (DF) is *decreasing* and the mean is moving to the *right*-hand side (i.e. increasing) as shown in Figure 4.12 (*right*-hand side of mean value). **case 2d:** the dispersion of the parameter Density Function (DF) is *decreasing* and the mean is moving to the *left*-hand side (i.e. decreasing) as shown in Figure 4.12 (*left*-hand side of mean value). By considering these non-uniform degradation rates, the behaviour of an arbitrary performance parameter 'P' for the above mentioned three scenarios can become unpredictable. For example, in case the arbitrary performance parameter 'P' follows "case 2a" (mentioned above) then 'P' in case of scenario 3 moves beyond the allowed boundaries before the 'P' in case of scenario 1. However, in case the arbitrary performance parameter 'P' follows "case 2b" (mentioned above) then 'P' in case of scenario 1 moves beyond the allowed boundaries before the 'P' in case of scenario 3. Similarly, if we assume two almost identical systems that have different architectures, for example two identical sound systems having only a different amplifier, the arbitrary performance parameter 'P' can degrade differently although it has similar initial values in both the amplifiers. This shows that the degradation behaviour of an arbitrary performance parameter 'P' is a complex function of initial values plus the architecture of the system as well. It will become even more complex if system-level performance parameters will change their degradation behaviour during aging or operational life. Therefore, estimating runtime (during operational life) reliability of a system based on the degradation of its system-level parameters will require a continuous monitoring and a small database (memory) of designed specifications as well as initial values of systemlevel parameters as discussed in the next sections. # 4.5 CRITICAL PERFORMANCE PARAMETERS Until now it has been established that system-level performance parameters can possibly be used to estimate the runtime reliability of a system. The next question that could be asked is which system-level performance parameters can provide the best reliability estimations. The solution to this question has two possible approaches: - 1) In the first approach, all of the system-level performance parameters essential for a particular application can be monitored and the reliability can be estimated based on the information obtained from these performance parameters. - 2) In the second approach, only the most critical system-level performance parameter in terms of aging effects can be monitored and the reliability can be estimated based on the information obtained from this single performance parameter. The first approach requires multiple performance monitoring circuits and most likely a complex mechanism to extract the correct reliability information. On the other hand, the second approach only requires a single performance monitoring circuit and probably a simple mechanism to extract the correct reliability information. Furthermore, the selection of the most critical system-level performance parameter is application dependent and can be divided into different categories based on their sensitivity to aging effects. The most sensitive system-level performance parameter can be acquired via aging simulations at design stage and can be selected as the best indicator for reliability Figure 4.13: Linear degradation (increasing or decreasing) of a performance parameter 'P' during time interval ' $t - t_0$ ' and associated iTTF. estimations. The second method has been preferred here compared to the first method to avoid complexity. The mathematical formulation of this method will be evaluated in the next section. # 4.6 QUANTITATIVE RUNTIME RELIABILITY ESTIMATION Let a system-level performance parameter 'P' be the critical performance parameter being the most sensitive to aging effects. Similarly, let $P_{\min}$ and $P_{\max}$ represent the designed functional specification boundaries for the performance parameter 'P' (i.e. $P_{\min} \leq P \leq P_{\max}$ ). Then at any point in time 't' the time it will take to change 'P' beyond specifications or the instantaneous-time-to-failure (iTTF) as introduced in Chapter 2, defined as the functional failure, can be regarded as a quantitative means of estimating the reliability at that particular time. The larger the remaining time before it fails, the higher will its reliability be and vice versa. Therefore, the iTTF has been used as a reliability estimator in this research. Suppose at time 't', the performance parameter 'P' has been degraded from $P(t_0)$ (initial value of 'P' at time 't = 0' or 't<sub>0</sub>') to P(t). Assuming a linear degradation (increase or decrease of 'P', Figure 4.13), the time it will take to change 'P' beyond its specifications is called its reliability $R(t) \cong iTTF(t)$ at time 't'. It will be given by: $$R(t) \cong iTTF(t) = \left[\frac{P_{max} - P(t)}{P(t) - P(t_0)}\right] * t$$ $$\tag{4.1}$$ in case 'P' has increased during the time interval ' $t - t_0$ ', provided that $P(t) \neq P(t_0)$ , as shown in Figure 4.13(a). Similarly, it will be given by: $$R(t) \cong iTTF(t) = \left[\frac{P(t) - P_{min}}{P(t) - P(t_0)}\right] * t$$ $$\tag{4.2}$$ in case 'P' has decreased during the time interval ' $t - t_0$ ', provided that $P(t) \neq P(t_0)$ , as shown in Figure 4.13(b). As stated before system-level parameters (e.g. propagation delay in Figures 4.3(b) and 4.6(b), and reference voltage in Figure 4.4(b)) are not only connected to device-level parameters (e.g. $V_{th}$ ) but also their degradation rate is dependent on the initial threshold voltage ( $V_{th}(t_0)$ ) values (Figure 4.6(b)). 74 C H A P T E R 4 Similarly, from Figures 4.2, 4.3(a), 4.4(a), and 4.6(a) it can be concluded that the initial value and time dependent behaviour of $V_{th}$ is [Mar09]: $$V_{th}(t) = V_{th}(t_0) + At^n (4.3)$$ and $$A = f(V_{DS}, V_{GS}, V_{th}(t_0), T, W, L, RDF, OTF, LER, FF, TT, SS)$$ $$(4.4)$$ Here, $V_{th}(t_0)$ is the initial threshold voltage (for an unstressed device), 't' is the time and 'n' is a degradation parameter (about 0.18 for NBTI [Mar09]). 'A' is a function of geometrical (e.g. length 'L' and width 'W'), environmental (e.g. temperature 'T'), and process-related (e.g. random dopant fluctuation RDF) transistor parameters. Therefore, the performance parameter 'P' will also be a function of $V_{th}$ (its initial and time dependent values as explained above) and the reliability R(t) of the system defined in equations (4.1) and (4.2) can be rewritten as: $$R(t, V_{th}(t)) \cong iTTF(t, V_{th}(t)) = \left[\frac{P_{max} - P(t, V_{th}(t))}{P(t, V_{th}(t)) - P(t_0, V_{th}(t_0))}\right] * t$$ (4.5) in case 'P' has increased during the time interval ' $t - t_0$ ', provided that $P(t, V_{th}(t)) \neq P(t_0, V_{th}(t_0))$ . In case 'P' has decreased during the time interval ' $t - t_0$ ', provided that $P(t, V_{th}(t)) \neq P(t_0, V_{th}(t_0))$ , then one obtains: $$R(t, V_{th}(t)) \cong iTTF(t, V_{th}(t)) = \left[\frac{P(t, V_{th}(t)) - P_{min}}{P(t, V_{th}(t)) - P(t_0, V_{th}(t_0))}\right] * t$$ (4.6) The above equations assume that there is a linear degradation during the time interval ' $t-t_0$ '. In case there are non-linear degradations, one can divide the time into 'n' (equal or non-equal) time points ( $t_0, t_1, t_2, \ldots, t_{(n-1)}, t_n$ ). During each time interval ( $t_1-t_0, t_2-t_1, \ldots, t_n-t_{n-1}$ ) the degradation remains nearly constant. Therefore, equations (4.5) and (4.6) can be rewritten as: $$R(t_{n}, V_{th}(t_{n})) \cong iTTF(t_{n}, V_{th}(t_{n})) = \left[\frac{P_{max} - P(t_{n}, V_{th}(t_{n}))}{P(t_{n}, V_{th}(t_{n})) - P(t_{n-1}, V_{th}(t_{n-1}))}\right] * t_{n} (4.7)$$ in case 'P' has increased during the time interval ' $t_n - t_{n-1}$ ', provided that $P(t_n, V_{th}(t_n)) \neq P(t_{n-1}, V_{th}(t_{n-1}))$ . In case 'P' has decreased during the time interval ' $t_n - t_{n-1}$ ', provided that $P(t_n, V_{th}(t_n)) \neq P(t_{n-1}, V_{th}(t_{n-1}))$ , then it will be given by: $$R(t_n, V_{th}(t_n)) \cong iTTF(t_n, V_{th}(t_n)) = \left[\frac{P(t_n, V_{th}(t_n)) - P_{min}}{P(t_n, V_{th}(t_n)) - P(t_{n-1}, V_{th}(t_{n-1}))}\right] * t_n (4.8)$$ Equations (4.7) and (4.8) are further used in section 4.8 where a target system is simulated for the presented workflow in order to enhance the system dependability. Figure 4.14: Workflow of the proposed approach for estimating the reliability of a system and taking proper actions for enhancing dependability. # 4.7 PROPOSED DEPENDABILITY WORKFLOW The above discussion provides the important motivation that for estimating the correct reliability of a system based on the degradation of its system-level performance parameters it will be necessary to regularly monitor and store system-level parameters as well as initial values of specifications in a database. The initial values at the start will provide information about the process variations whereas the gradual degradation over time will provide information about the aging effects. Keeping this in mind, the suggested workflow shown in Figure 4.14 for estimating the reliability during the operational life and enhancing system dependability consists of a small database (memory) of design-level specifications (e.g. $P_{min}$ , $P_{max}$ , C, $iTTF_{min}$ and MTBF) for a particular application and for runtime logged values of critical performance parameter(s). Further included are the performance monitoring circuit(s) for the critical performance parameter(s) (e.g. 'P'). The database of system specifications (parameters) along with measurements of system-level parameters during its operational life will be further used to estimate the reliability $R(t) \cong iTTF(t)$ using equations (4.7) and (4.8). # 4.7.1 WORKING PRINCIPLE At the start, the system will be first put into a test mode to acquire its initial values of performance parameter(s) (e.g. $P(t_0, V_{th}(t_0))$ , 2 in Figure 4.14), as a result of process variations, using performance monitoring circuit(s) (Figure 4.16). The next time point ' $t_1$ ', when the system will be put into test mode again, will be estimated based on the distance it (i.e. $P(t_0, V_{th}(t_0))$ ) has from the specification boundaries (i.e. $P_{min}$ and $P_{max}$ ) and stored values (i.e. C and MTBF, 8 in Figure 4.14). This can be expressed as: $$t_1 = \left[\frac{P_{max} - P_{min}}{C * P(t_0, V_{th}(t_0))}\right] * MTBF$$ $$\tag{4.9}$$ (76) Figure 4.15: a) Linear degradation of the performance parameter 'P' as a function of iTTF, b) Linear degradation of the performance parameter 'P' vs the time interval for the next data acquisition (equations (4.9) and (4.10)). Where 'C' is a constant determined from the design-stage reliability simulations for performance parameter 'P' and can be adjusted in such a way that the degradation in 'P' remains nearly constant during each time interval ' $t_n - t_{n-1}$ '. At a point in time ' $t_1$ ', the system will be put again in test mode to acquire new values of the performance parameter, as a result of the aging effects, and after that the normal operation will be resumed. These values are subsequently stored in the database along with a time stamp (5 in Figure 4.14). The reliability $(R(t_1, V_{th}(t_1)) \cong iTTF(t_1, V_{th}(t_1)))$ will be estimated using equations (4.7) and (4.8) at this calculated point in time ' $t_1$ ' by having newly acquired values $(P(t_1, V_{th}(t_1)))$ of the performance parameter 'P' and already saved values $(P(t_0, V_{th}(t_0)))$ at time ' $t_0$ ' in the database (memory, 8 in Figure 4.14). In the case $iTTF_{min}$ represents the minimum value to be monitored for taking repair actions, a decision about digital tuning or replacement can be taken if the estimated reliability $(R(t_1, V_{th}(t_1)) \cong iTTF(t_1, V_{th}(t_1)))$ is less than or equal to $iTTF_{min}$ (i.e. $iTTF(t_1, V_{th}(t_1)) \leq iTTF_{min}$ , 4 in Figure 4.14). The next point in time for the next acquisition will be calculated based on this new reliability information. That is: $$t_{2} = \left[\frac{P_{max} - P_{min}}{C * P(t_{1}, V_{th}(t_{1}))}\right] * iTTF(t_{1}, V_{th}(t_{1}))$$ (4.10) At this new point in time the reliability $(R(t_2, V_{th}(t_2)) \cong iTTF(t_2, V_{th}(t_2)))$ and the next point in time ' $t_3$ ' will be calculated. In this way, this process will continue during its operational life. Figure 4.15 shows the above mentioned calculations (equations (4.9) and (4.10)) for a system where linear degradations have been considered for the system-level performance parameter 'P'. The values used for $P_{min}$ , $P_{max}$ , C, and MTBF are given in Table 4.1. The initial value for the performance parameter 'P' has been assumed to be 100. These calculations show that in 3000 hours the performance parameter 'P' has degraded from its initial value of 100 to 110 which is the maximum upper limit for 'P'. However, the time interval after which the new values of the performance parameter 'P' are acquired has constantly reduced from 0.9 hours to 0.004 hours (~14 sec). This is in accordance with the fact that as the performance parameter 'P' is approaching towards its specification boundaries (i.e. $P_{min}$ , $P_{max}$ ) it has to be monitored more often in order to avoid failures. The next section will discuss how these calculations will play an important role in improving the system dependability. # 4.7.2 DEPENDABILITY IMPROVEMENTS In the presented dependability workflow, estimating reliability quantitatively has multiple advantages. On one side, it gives the direct means of estimating the reliability of a system at any time during its operational life. While on the other side, by having this value, proper precautionary actions (like deciding possible repair actions) to prevent failure or minimizing dangers to the environment can be anticipated in advance. In the case, digital repair options are available for the sub-systems of a larger system then reliability estimations can be used for the dependability improvements of the system as well. For example, under critical situations, i.e. if the estimated iTTF is less than or equal to $iTTF_{min}$ (i.e. $iTTF(t_n, V_{th}(t_n)) \leq iTTF_{min}$ , the minimum time before which a repair action should be taken in order to avoid failure), digital tuning or replacement actions of its sub-IPs could be taken to remain within specification limits of the performance parameters (e.g. $P_{min}$ , $P_{max}$ ) and hence are the means to improve the system *reliability*. In practice, the actual replacement of one sub-IP with another redundant sub-IP can be achieved by using electronic switches as discussed in Chapter 3. Anticipating digital tuning and replacement actions in advance based on the regular reliability estimations will reduce the mean-time-to-repair (MTTR). In other words the **maintainability** of the system can be increased. This means the system will know in advance at what point in time it has to take digital repair actions. Theoretically, by anticipating repair actions in advance, the repair time can be reduced to near zero. Furthermore, the availability of the system at any time ' $t_n$ ' will be defined by (equation (3.1)): $$A(t_n, V_{th}(t_n)) = \frac{iTTF(t_n, V_{th}(t_n))}{iTTF(t_n, V_{th}(t_n)) + MTTR(t_n)}$$ $$\tag{4.11}$$ The above equation shows that by reducing MTTR near to zero the *availability* $A(t_n, V_{th}(t_n))$ of the system can be increased to 100% [Kha11]. Therefore, by estimating the reliability of the system during its operational life to remain within its specification boundaries will ultimately increase its reliability, maintainability, availability and hence the *dependability* of the whole system. # 4.8 SIMULATIONS AND RESULTS In order to investigate the proposed idea of estimating the reliability of an electronic system during its operational life under different linear and non-linear degradation behaviours, a simulation setup was constructed in the LabVIEW environment. Its results in improving the system dependability are discussed in the following sections. Sections 4.8.1 and 4.8.2 present the target system and the different aging degradation behaviours used for the performance parameter 'P' respectively. The simulator GUI, its different parts and simulation results are discussed in sections 4.8.3, 4 Figure 4.16: An example system consisting of digitally tuneable redundant sub-IPs (SIP) for running simulations using the proposed workflow of Figure 4.14. Table 4.1: Necessary details of $SIP_{1(A,B,C)}$ used in the LabVIEW simulations. | Mean value of the designed parameter 'P' | 100 a.u. | |------------------------------------------------|--------------------------| | Allowed boundaries of the parameter 'P' | [90 110] a.u. | | Number of redundant IPs available | 3 | | Digital tuning options for the parameter 'P' | 8 (3 digital bits) | | Digitally tuning range for each digital option | 2% of initial value of P | | Maximum MTBF for each SB1 <sub>(A,B,C)</sub> | 3000 hr (for example) | | Constant 'C' value | 667 | 4.8.4, and 4.8.5 respectively. Important overheads as a result of the presented technique are presented in section 4.8.6. ## 4.8.1 SIMULATION SETUP A general target system consisting of sub-IPs $SIP_1$ , $SIP_2$ , $SIP_3$ and $SIP_4$ has been considered in order to investigate the proposed idea as shown in Figure 4.16. Each sub-IP further consists of three redundant sub-IPs. For example $SIP_{1A}$ , $SIP_{1B}$ , and $SIP_{1C}$ for $SIP_1$ . Each one of these sub-IPs has three digital tuning knobs. This means each sub-IP has eight possible digitally tuneable options. The performance monitoring circuits are responsible to monitor the most sensitive performance parameters for estimating the reliability of each individual sub-IP and hence of the whole system as shown by vertical dotted lines in Figure 4.16. The "Decision Making, Tuning and Replacement Circuitry" will gather information from the "Performance Monitoring Circuits" and from the "Database of System Specifications". Based on this information the "Decision Making, Tuning and Replacement Circuitry" will take further necessary digital tuning or replacement actions. For simplicity, only a single sub-IP with all its redundant sub-IPs, i.e. $SIP_{1(A,B,C)}$ , has been assumed for simulations. A single arbitrary performance parameter 'P' has been considered the potential critical performance parameter with respect to aging effects and process variations. Table 4.1 shows the necessary details of this sub-IP $(SIP_{1(A,B,C)})$ . #### 4.8.2 SIMULATION OF DEGRADATION BEHAVIOURS In order to simulate a variety of different linear and non-linear aging degradation behaviours for the performance parameter 'P', four different degradation behaviours have been assumed as discussed below. In these degradation behaviours, 'P' may or may not be a function of the initial values due to fabrication-related process variations. 80 **Logarithmic Degradation:** The logarithmic degradation behaviour represents the degradation behaviour which is purely based on the constant logarithmic behaviour and is independent of the initial value of 'P'. This means that for every initial value of 'P' the degradation behaviour will follow the same constant logarithmic behaviour. **Proportional Logarithmic Degradation:** The proportional logarithmic degradation behaviour represents the degradation behaviour in which the logarithmic behaviour of performance parameter 'P' depends on its initial value. Initial value of 'P' close to ' $P_{min}$ ' will result in a slow logarithmic degradation rate whereas initial values of 'P' close to ' $P_{max}$ ' will result in a fast logarithmic degradation rate. **Random Logarithmic Degradation:** The random logarithmic degradation represents the degradation behaviour which has a random (slow or fast) logarithmic behaviour and is independent of the initial value of performance parameter 'P'. This means that the degradation behaviour of the performance parameter will be logarithmic. However, the degradation rate, i.e. how fast or slow it will degrade over time, will be completely random. **Constant Linear Degradation:** The constant linear degradation behaviour represents the degradation behaviour which has a constant linear behaviour and is independent of the initial value of performance parameter P. This means that for every initial value of P the degradation behaviour will follow the same constant linear behaviour. ## 4.8.3 THE SIMULATOR GUI In order to simulate the presented target system, a simulator GUI has been constructed in the LabVIEW environment as shown in Figure 4.17. This GUI has a flexible architecture where different input parameters can be selected in order to investigate different degradation behaviours and the corresponding *TBF* & total life time of the target system. The "Behaviour" tab controls the overall increasing or decreasing behaviour of the performance parameter 'P'. The "Simulation Speed" parameter is used to control the simulation speed for visibility and understanding purposes. The "No. of Redundant Blocks" controls the available number of redundant IPs for each main IP of the system. The "Digital Tuning Options" control provides the available number of digital tuning options for each IP in the system. Similarly, the "Maximum MTBF" and "Designed Parameter P Value" provide the design-level MTBF and performance parameter 'P' values. Furthermore, the allowed variation in performance parameter 'P' for a redundant IP and for each digital tuning options can be selected by using "% Allowed Variation in P" (outer horizontal lines explained below) Figure 4.17: Simulation results of a system consisting of three redundant sub-IPs $(SIP_{I(A,B,C)})$ . Each sub-IP $SIP_{I(A,B,C)}$ has eight digital tunable options for performance parameter 'P'. Four different aging degradation possibilities for parameter 'P' have been considered and simulated for same the system with same initial/start values of 'P'. A failure is defined in the case the parameter 'P' goes beyond its defined boundaries (outer horizontal lines for $P_{min}$ and $P_{max}$ ). and "% Digital Tuning Variation in P" (inner horizontal lines explained below) respectively. In Figure 4.17, there are four types of lines available for each degradation behaviour as explained below: Outer Horizontal Lines: The outer horizontal lines ( $\alpha$ ) indicated in Figure 4.17 show the allowed boundaries of the performance parameter 'P' (i.e. $P_{min}$ and $P_{max}$ ). If the performance parameter 'P' lies within these boundaries then the system is considered to be functioning correctly. Similarly, in case the performance parameter goes beyond these boundaries the system will start to function incorrectly. In the present case, they have been considered to be $\pm 10\%$ of the designed value of performance parameter 'P'. This means, every IP or sub-IP will have the performance parameter 'P' within the limits of $P_{min}$ and $P_{max}$ ; if not then it will be functioning incorrectly. Inner Horizontal Lines: The inner horizontal lines ( $\beta$ ) show the range of possible initial values of the performance parameter 'P'. These values correspond to the different available digital tuning options. In the present case, they have been considered to lie within 2% of the initial value of the performance parameter 'P' (Figure 4.17). This means, every new value of the performance parameter 'P' produced as a result of digital tuning option will remain within 2% of the start value of the performance parameter 'P' at time ' $t_0$ '. Complete Vertical Lines: The complete vertical lines ( $\gamma$ ) indicated in Figure 4.17 show the points in time when the performance parameter 'P' is about to cross the allowed boundaries of specification (i.e. $P_{min}$ or $P_{max}$ ). At this point in time the digital tuning options have been used for tuning the performance parameter 'P' back to its allowed boundaries. **Dotted Vertical Lines:** The bold-dotted vertical lines ( $\delta$ ) show the points in time where $SIP1_A$ has been replaced with a redundant sub-IP (i.e. by $SIP_{1B}$ or $SIP_{1C}$ ). Therefore, they show the points in time when a complete IP has been replaced with another redundant sub-IP (Figure 4.17). The simulation results of the target system will be discussed in detail in section 4.8.5. ## 4.8.4 RANDOMLY SELECTED VALUES The initial or start value of 'P' of each redundant sub-IP has been randomly selected from the allowed range for the performance parameter 'P' (i.e. in between the outer horizontal lines for $P_{max}$ and $P_{min}$ ). Whereas, the start value for each sub-IP, as a result of each possible digital tuning option to tune the performance parameter 'P' back to its allowed boundaries, has been randomly selected from the digitally tuneable range of that sub-IP. That is, in between the inner horizontal lines having values within 2% of the initial value of the performance parameter 'P'. The purpose of randomly selecting the values of performance parameter 'P' is to show the possible initial value variations as a result of fabrication-related process variations. The simulation results of $SIP_{1(A,B,C)}$ , by assuming different degradation behaviours (mentioned above) for its performance parameter 'P', and the corresponding numerical values extracted from these simulation results are shown in Figure 4.17 and Table 4.2 respectively. The simulation starts with the redundant sub-IP $SIP_{1A}$ . The performance monitoring circuit regularly (equations (4.9) and (4.10)) monitors the performance parameter 'P' during the regular test mode operation and communicates with the database for storing performance parameter 'P' values. These values are then used to estimate its reliability using equations (4.7) and (4.8). This is further used for taking necessary digital tuning or replacement actions by the "Decision Making, Tuning and Replacement Circuitry" (Figure 4.16) at the right time (i.e. if $R(t) \cong iTTF(t) \leq iTTF_{min}$ ) before the performance parameter 'P' moves beyond its defined specifications (i.e. beyond $P_{min}$ and $P_{max}$ ). Initially, a sub-IP (e.g. $SIP_{1A}$ ) will be digitally tuned via digital knobs to remain within defined specification boundaries (i.e. $P_{min}$ and $P_{max}$ ) and in case the required tuning range exceed the digital tuning options available for this sub-IP ( $SIP_{1A}$ ), it will be replaced with a redundant sub-IP (e.g. $SIP_{1A}$ with $SIP_{1B}$ ). These simulation results and the corresponding numerical values provide important information in the sense that by considering different linear and non-linear degradation behaviours of the performance parameter 'P', with the same initial value (i.e. the starting value of 'P' is same for each degradation behaviour as shown in Table 4.2), the operational life time of each system is different. The system in which all the redundant IPs have a constant linear degradation behaviour for the performance parameter 'P' (Table 4.2) shows the maximum lifetime (i.e. 43076 hours). Whereas, the system in which all the redundant IPs have the random degradation behaviour for the performance parameter 'P' (Table 4.2) exhibits the minimum lifetime (i.e. 5736 hours). It is also obvious from Table 4.2 that different initial values of performance parameter 'P', with different degradation behaviours, will give completely different behaviours of the same system. It becomes quite complicated to decide at which time one has to start digital tuning or replacing the sub-IP for enhancing its reliability or availability. For example the sub-IP SIP<sub>1A</sub>, with a constant logarithmic degradation behaviour for the performance parameter 'P', that was expected to be digitally tuned after every 231 hours (bold box in Table 4.2) is no longer valid for other digital tunings. It is further digitally tuned after 240, 206, 223, 242, 232, 226, and 222 hours respectively. Similarly, the first complete replacement of sub-IP (i.e. $SIP_{1A}$ with $SIP_{1B}$ ) has been carried out after 1829 hours (bold under-lined box in Table 4.2) while the second complete replacement of sub-IP (i.e. $SIP_{1B}$ with $SIP_{1C}$ ) has been done after 7571 (9400-1829 = 7571) hours. This is quite a large value as compared to the first replacement time. It becomes extremely complex to decide about the right time of repair or replacement time based on the initial reliability calculations carried out at design stage especially in case the performance parameter 'P' has a random degradation behaviour. Some of these instances are also highlighted by bold dotted boxes in Table 4.2. The above discussion necessitates the usage of the proposed workflow of Figure 4.14 for regularly estimating the reliability during the operational life and enhancing system dependability by taking proper actions at the proper time. It is also clear from these simulations that by incorporating the proposed strategy the system can be better managed in real time. That is deciding at what time the 83 C H A P T E R Table 4.2: Numerical results of simulations conducted in Figure 4.17. The start value of 'P' for every digital tuning option (i.e. 001-111) lies within 2% of its initial value (P(000)). Similarly, the initial value of 'P(000)' for each redundant sub-IP (SIP<sub>1(A,B,C)</sub>) lies within the designed specification bounds (i.e. [90 110]). All of these values are randomly selected to show the possible initial value variations due to fabrication-related process variations. Decision times are measured from the beginning of the simulation. | | Ē | | 98,435 | 12163 | 2565 | 402 | | 98,435 | ,038 | 8265 | 227 | | 98,435 | 5459 | 5736 | 277 | | 3,435 | 1342 | 43076 | 1734 | | | | | | | | | | | | | | | | | | | | |----------------------------------------|-------------|-----------------------------------------------------------------------------------------------------|--------------------|-----------------|--------------------|-------------|-----------|-----------------------------------|------------------|--------------------|------------------|--------------------------------|-----------------------|-----------------|--------------------|----------------|-----------------------------|--------------------|-------------------------------------|--------------------|-------------------------------------------------------------------------|------------|----------|------------|-----------|-----------|---------|-----------------|--------|-------|---------|--------|---------|------|-----|---------|--------|---------|-------|------| | SB1 <sub>c</sub> Digital Tuning Values | 101 110 111 | 8 94,922 94,774 94,715 98.693 98.798 98.735 98.379 98.076 98.545 98.469 | | 763 1; | 12162 12565 | | 98,469 98 | 7815 8 | 8037 8 | 222 | | 98,469 98 | 5277 5 | 5458 5 | 181 | | 98,469 98,435 | 39612 41342 | 41341 43 | 1729 1 | | | | | | | | | | | | | | | | | | | | | | | 1 1 | | | 11371 11763 | 11762 12 | | | 98,545 98 | 7604 | 7814 8 | 210 2 | | 98,545 98 | 4946 5 | 5276 5 | 330 | | 98,545 98 | 37893 39 | 39611 41 | 1718 1 | | | | | | | | | | | | | | | | | | | | | | 1 | | | 10930 11 | 11370 11 | 440 | | | | | | | | 98,076 98, | 7315 76 | 7603 78 | 288 2 | | 98,076 98, | 4593 49 | 4945 52 | 352 3 | | 98,076 98, | 36104 37 | 37892 39 | 1788 17 | | | | | | | | | | | | | | | | 1 | | | | 10929 113 | | | | | | | _ | 7078 73 | 7314 76 | 236 28 | | 98,379 98, | 4370 45 | 4592 49 | 222 3 | | 98,379 98, | 34360 36 | 36103 378 | 1743 17 | | | | | | | | | | | | | | | | | 1c Digit | 0 0 | | | 10147 10521 | 10520 109 | 373 408 | | | | | | | | 98,735 98,379 | 6895 70 | 7077 73 | | | | 4077 43 | 4369 45 | Н | | | 32670 343 | 34359 361 | 1689 17 | | | | | | | | | | | | | | | S | 1 01 | | | | | Н | | 98 | Н | | 174 182 | | 98 98,735 | | _ | | 98 98,735 | | - | ш | | | | | | | | | | | | | | | | | | | | | | | 00 | | | 9779 | 3 10146 | 367 | | 33 98,798 | 6720 | 9 6894 | | | 3 98,798 | 3 3837 | 5 4076 | 539 | | 3 98,798 | 2 30989 | 8 32669 | 1680 | | | | | | | | | | | | | | | | | | | | | | 00 | | 5 98,69 | 9401 | 9778 | 377 | | 94,774 94,715 98,693 | 6506 | 6119 | 213 | | 5 98,69 | 3623 | 3836 | 850 213 | | 5 98,69 | 29292 | 30988 | 1696 | | | | | | | | | | | | | | | | | | | | | | 111 | | 1 94,71 | 8410 | 9400 | | | 1 94.71 | 5746 | 6505 | 759 | | 94,774 94,715 98,693 | 2772 | 3622 | | 94,774 94,715 98,693 | 5 26999 | 29291 | 2232 | | | | | | | | | | | | | | | | | | | | | | | 110 | | 94,774 | 7433 | 8403 | 976 | | 94,774 | 5012 | 5745 | 733 | | | 2558 | 2771 | 213 | | | 24715 | 26998 | 2283 | | | | | | | | | | | | | | | | | | | | | Values | 101 | | 94,922 | 6483 | 7432 | 943 | _ | 94,922 | 4340 | 5011 | 671 | | 94,922 | 2535 | 2557 | 2 | | 94,922 | 22453 | 24714 | 2261 | | | | | | | | | | | | | | | | | | | | | Funing | 100 | | 94,828 | 5524 | 6488 | 964 | adation | 94,828 | 3629 | 4339 | 710 | ation | 95,090 94,828 | 2473 | | و<br>ا | <u></u> | 94,828 | 20177 | 22452 | 2275 | | | | | | | | | | | | | | | | | | | | | SB1 <sub>B</sub> Digital Tuning Values | 011 | Degrac | Degrad<br>95,090 | 4617 | 5523 | 906 | c Degra | 94,540 95,090 94,828 94,922 | 3023 | 3628 | 605 | Degrad | 95,090 | 1907 | 2472 | 565 | gradati | 95,090 | 17940 | 20176 | 2236 | | | | | | | | | | | | | | | | | | | | | | 010 | rithmic | | 2607 3584 | 4616 | 1032 | arithm | | | 3022 | 838 | Random Logarithmic Degradation | 94,540 | 1554 1619 | 1618 1906 | 287 | Constant Linear Degradation | 94,540 | 11196 13336 15620 17940 20177 22453 | 13335 15619 17939 | 2139 2283 2319 | | | | | | | | | | | | | | | | | | | | | | 100 | Constar<br> 100,631 100,474 101,054 100,746 100,435 100,593 100,692 100,770 95,736 | | 3583 | 926 | nal Log | 535 | Proportional Log | Proportional Log | Proportional Log | Proportional Log | Proportional Log | oportional Log | roportional Log | oportional Log | oportional Log | oportional Log | oportional Log | nal 94,774 | 1450 | 2183 | 733 | n Logai | 94,774 | 1554 | 1618 | 64 | tant Li | 94,774 | 13336 | 15619 | 2283 | | | 000 | | 95,736 | 1830 | 2606 | 776 | | | | | | | | | | | | | 95,736 | 968 | 1449 | 553 733 | Randor | 95,736 | 1000 | 1553 | 553 | Cons | 95,736 | 11196 | 13335 | 2139 | | | | | | | | | | | 111 | | 00,770 | 1607 | 1829 | 232 226 222 | | | | | | | 100,770 95,736 94,774 | 792 | 895 | 103 | | 100,770 95,736 | 823 | 666 | 176 | | 00,770 | 9811 | 11195 | 1384 | | | | | | | | | | | | | | | | | 110 | | 100,692 | 1380 | 1606 | | | | | | | | | 100,692 | 681 | 191 | 110 | | 100,692<br>782<br>822<br>40 | | 100,692 100,770 95,736 94,774 94,540 95,090 94,828 94,922 | 8414 | 9810 | 1396 | | | | | | | | | | | | | | | | | | lues | 101 | | | 1147 | 1379 | | | | | | | 100,593 | 295 | 089 | 118 | | 100,593 | 920 | 781 | 131 | | 100,593 | 7002 | 8413 | 1411 | | | | | | | | | | | | | | | | | uning Va | | | 100,435 | 904 | 1146 | 242 | | 100,435 | 428 | 428<br>561 | 133 | 100,435 | 641 | 649 | [ i | | 100,435 | 2962 | 1001 | 1434 | | | | | | | | | | | | | | | | | | | | | | SB1 <sub>A</sub> Digital Tuning Values | 010 011 100 | | 100,746 | 680 | 903 | 223 | | 100,746 | 322 | 427 | 105 | | 100,746 | 455 | 640 | 185 | | 100,746 | 4178 | 9999 | 1388 | | | | | | | | | | | | | | | | | | | | | | 010 | | 101,054 | 473 | 629 | 206 | | | | | | | | | | | | | | | | | | | | | | 100,474 101,054 | 240 | 321 | 8 | | 101,054 | 298 | 454 | 156 | | 101,054 | 2836 | 4177 | | | 100 | | 100,474 | 232 | 472 | | 100,474 | 110 | 239 | 129 | | 100,474 | 110 | 297 | 187 | | 100,474 | 1407 | 2835 | 1428 | | | | | | | | | | | | | | | | | | | | | | | 000 | | 100,631 | 0 | 231 | 100,631 | • | 109 | 109 | | 100,631 | 0 | 109 | 69 | | 100,631 | 0 | 1406 | 1406 | | | | | | | | | | | | | | | | | | | | | | | | | | Start Value of 'P' | Start Time [hr] | Decision Time [hr] | TBF [hr] | | Start Value of 'P' | Start Time [hr] | Decision Time [hr] | TBF (hr) | | Start Value of 'P' | Start Time [hr] | Decision Time [hr] | TBF [hr] | | Start Value of 'P' | Start Time [hr] | Decision Time [hr] | TBF [hr] | | | | | | | | | | | | | | | | | | | | 4 decision making, tuning and replacement circuitry has to digitally tune or replace the sub-IPs for increasing its reliability. By anticipating in advance and taking the right decisions at the right time, the repairing time will be reduced and the maintainability of the system can be enhanced. Theoretically, the repair time (MTTR) can be reduced to zero. Therefore, by reducing the repair time to zero the availability, according to equation (4.11), can be increased to 100%. It means, by having proper maintainability with reduced repair time and hence increased reliability and availability, the dependability of the system will increase. # 4.8.6 Possible Overhead and Overall Performance In practice, the resolution or monitoring accuracy of the performance monitoring circuits as well as their aging behaviour and the corresponding digital tuning accuracy will play an important role in the overall effectiveness of the presented technique. Furthermore, the performance monitoring circuits, the switches, and the redundant sub-IPs will impose serious area overheads (depending on the number of redundant sub-IPs, performance monitoring circuit(s) and switches) on one side while on the other side they are essential for better dependable design. Similarly, the reliability of the digital circuits for decision making and digital repair, may introduce some delay or cause slower processing during repair. However, the overall impact can be ignored by taking repair actions well in advance before the system goes beyond its design specifications. Similarly, the performance monitoring circuits could impose circuit overloading effects that can potentially be solved by using an indirect technique as discussed in the following sections. # 4.9 INDIRECT RELIABILITY ESTIMATION Simulation-based techniques at the design-stage have been frequently used to estimate the circuit-level reliability of digital circuits as well as analog circuits. Circuit-level reliability degradations are a complex function of different parameters including bias points, stress or temporal voltages and temperature, and process-related statistical static and temporal variations. For a long time, different analytical approaches have been thoroughly investigated in literature to examine the circuit-level aging effects based on the device-level models [Mar11, Pau05, Kum06, Wan07a, Wan07b, Zha08]. These analytical approaches also include some indirect ways to estimate the circuit-level reliabilities. For example, the maximum digital circuit delay degradation [Wan08] due to NBTI closely follows the same power-law dependency as a function of stress time as the transistor threshold voltage ( $V_{th}$ ) degradation. The maximum frequency ( $F_{max}$ ) degradation of ring oscillators [Lee03] has also been investigated as a potential reliability estimator at design stage. But they possess limitations, since they do not relate to the netlist of the actual target circuit. Furthermore, the total quiescent supply current ( $I_{\rm DDQ}$ ) [Kan07a], conventionally used for testing potential faults in digital as well as analog and mixed signal circuits after fabrication [Eck93, Zja05], has also been used to indicate the reliability hazards of digital circuits. It is based on the temporal degradation of $I_{\rm DDQ}$ since the percentage degradation in $I_{\rm DDQ}$ closely follows the same power-law dependency with respect to stress time as the threshold voltage ( $V_{th}$ ) degradation [Kan07a, Kan07b]. Similarly, a reliability-analysis technique has been proposed in [Pan10] by using lifetime yield prediction of analog circuits. But it has also limitations, since it assumes linear degradation and ignores other effects. These analytical approaches are only available to investigate the circuit-level reliability effects based on the device-level models of electronic systems at design stage. However, reliability estimation during the operational life of an electronic system still lacks a solution especially for analog and mixed signal systems. The methodology presented in the previous sections can be considered as a potential way of estimating the runtime reliability of electronic systems. However, it poses potential overloading problems while directly interacting with the critical (internal) nodes of the system [Chu12]. Therefore, an indirect approach is favoured for such reliability estimations. The following sections will present a novel technique for indirectly estimating the reliability during the operational life of an electronic system. The presented technique is based upon the reliability simulations conducted during the design stage. Reliability simulations for critical performance parameter(s), sensitive to aging effects, over a range of input-stress voltages and working-stress temperatures have been used to generate a set of degradation values per unit time. This means, for each input-stress voltage and working-stress temperature the degradation value per unit time of critical performance parameter(s) will be calculated. These degradation values along with their input-stress voltages and working-stress temperatures for the critical performance parameter(s) are stored in a database (memory, an example is discussed in section 4.9.2.1). Therefore, by knowing the input-stress voltage and the working-stress temperature the corresponding degradation in the critical performance parameter(s) can be extracted back from the stored values in the database. These degradation values per unit time are further used to estimate the degradation in the potential critical performance parameter and hence the means to estimate the system reliability as discussed in the following sections. ## 4.9.1 DESIGN-STAGE DEGRADATION RATE EXTRACTION In order to extract the design-stage degradation rate for a critical performance parameter one first has to select the different important degradation mechanisms responsible for system performance degradations. Secondly, the critical performance parameter with respect to aging effects has to be selected as discussed in section 4.5. To simplify the extraction rate procedure, we will only consider a single performance parameter 'P' being the most sensitive to aging effects. NBTI is considered to be the most dominant reliability degradation mechanism in aging sensitive technology nodes. The NBTI degradation, which affects the threshold voltage of a transistor, typically follows a power law of stress time and can be represented as a function of the electric field $E_{ox}$ in the MOS's gate dielectric and the operating temperature T [Gie11]: $$\Delta V_{th} = \left[ e^{\alpha_3 E_{ox}} \cdot e^{\frac{-E_a}{kT_{STRESS}}} \right] * t^{n_{NBTI}}$$ (4.12) The above expression shows that the degradation in the threshold voltage of a transistor as a result of the NBTI effect is a function of electric field $E_{ox}$ and temperature $T_{\rm STRESS}$ . The electric field $E_{ox}$ (= $(V_{GS} - V_{th})/t_{ox}$ ) is proportional to the applied input-stress voltage $V_{\rm STRESS}$ [Vat06]. Here $V_{GS}$ denotes the applied gate voltage and $t_{ox}$ represents the gate oxide thickness. Hence one can state: $$\Delta V_{th} = f(V_{STRESS}, T_{STRESS}, t) \tag{4.13}$$ This means the change in the threshold voltage $V_{th}$ as a result of NBTI effect will be a function of the input-stress voltage $V_{STRESS}$ and working-stress temperature $T_{STRESS}$ over the stress time t. It has already been stated in sections 4.2 and 4.3 that the system-level performance parameters are linked to device-level performance parameters. In this way, the change in the threshold voltage $V_{th}$ , the device-level parameter, will result in a change in the system-level performance parameters. In other words, the degradation in the system-level critical performance parameter t, sensitive to NBTI effects and hence to threshold voltage t, will also be a function of input-stress voltage t, and working-stress temperature t, over the stress time t. $$\Delta P_{NRTI} = f(V_{STRESS}, T_{STRESS}, t) \tag{4.14}$$ This expression shows that if the input-stress voltage ${}^{\prime}V_{STRESS}{}^{\prime}$ and the working-stress temperature ${}^{\prime}T_{STRESS}{}^{\prime}$ are changing from one stress time interval (e.g. ${}^{\prime}t_0 \rightarrow t_1{}^{\prime}$ ) to another stress time interval (e.g. ${}^{\prime}t_1 \rightarrow t_2{}^{\prime}$ ) then the corresponding change or degradation in the system-level performance parameter ${}^{\prime}P{}^{\prime}$ will also be different from one stress time interval to another stress time interval. This degradation in the performance parameter ${}^{\prime}P{}^{\prime}$ further depends on the amount of time (e.g. $|t_1-t_0|$ and $|t_2-t_1|$ ) these stresses are applied. Therefore, it is important to know the degradation per unit time or the degradation rate for a particular input-stress voltage ${}^{\prime}V_{STRESS}{}^{\prime}$ and working-stress temperature ${}^{\prime}T_{STRESS}{}^{\prime}$ . This can be expressed as: $$\frac{d}{dt}(\Delta P_{NBTI}) = \frac{d}{dt}(f(V_{STRESS}, T_{STRESS}, t))$$ (4.15) This degradation rate can be extracted from simulations at design stage. One simple approach is to stress the electronic system continuously at each individual combination of input-stress voltage ' $V_{STRESS}$ ' and working-stress temperature ' $T_{STRESS}$ ' over the fixed stress time 't', for example twenty years. By doing this one can have two values of the critical performance parameter 'P'. One value represents the fresh value ' $P_0$ ' at $t = t_0$ without the stress time and the other value ' $P_t$ ' represents the degraded value due to stress over the stress time 't' (e.g. 20 years) for each individual combination of input stress values $V_{STRESS}$ and $T_{STRESS}$ . This can be expressed as: 87 H A P T E R 4 C $$P_{0}(i,j) = f(V_{STRESS}(i), T_{STRESS}(j), t_{0})$$ $$P_{t}(i,j) = f(V_{STRESS}(i), T_{STRESS}(j), t)$$ $$\Delta P_{NBTI}(i,j) = P_{t}(i,j) - P_{0}(i,j)$$ $$(4.16)$$ where i and j spans over the possible input-stress voltage and working-stress temperature values with ' $t_0$ ' the start time and 't' the operational time (e.g. 20 years) for which the stress has been applied. The average degradation in performance parameter 'P' per unit time for each individual stress value $V_{STRESS}(i)$ and $T_{STRESS}(j)$ over the stress time interval ' $t-t_0$ ' due to the NBTI effect can be extracted as: $$\frac{\Delta P_{NBTI}(i,j)}{t - t_0} = \frac{P_t(i,j) - P_0(i,j)}{t - t_0}$$ (4.17) For example, in case the output offset of an amplifier degrades 10 mV in 20 years for a specific input stress voltage and working stress temperature then the output offset degradation per unit time will be 15.86 pV/s or 57.08 nV/hr. This approach assumes that there is a linear degradation behaviour of $P_{NBTI}$ over the stress time $t_i$ (e.g. 20 years). In the case of a non-linear degradation behaviour, the electronic system can be stressed continuously at each individual input-stress voltage $V_{STRESS}(i)$ and working-stress temperature $T_{STRESS}(i)$ over a predefined interval of time by assuming that the degradation remains linear over this interval of time. For example, design-stage degradation can be calculated at every hour or month or year over the total time of twenty years. Although, this will take a lot of effort, however, this will enable to provide more precise degradation behaviour, like non-linear, rather than simply linear degradation as previously discussed. Each of the above degradation rates corresponds to each individual input-stress voltage $V_{STRESS}(i)$ and working-stress temperature $T_{STRESS}(j)$ . Therefore, by having these values one can easily determine which degradation rate will be applicable to the performance parameter P' over a stress time $t_s$ if the corresponding input-stress voltage $T_{STRESS}(i)$ and working-stress temperature $T_{STRESS}(i)$ are known. For example, the degraded value of the performance parameter P' due to the NBTI effect for the stress time $t_s$ can be calculated as: $$P_{NBTI}(t_s) = P_{NBTI}(t_0) + \left[ \frac{\Delta P_{NBTI}(i,j)}{t - t_0} \right] * t_s$$ (4.18) Where $P_{NBTI}(t_0)$ is the initial value of the performance parameter 'P' at time ' $t_0$ ' and $[\Delta P_{NBTI}(i,j)/(t-t_0)]*t_s$ is the change in 'P' due to the NBTI effect over a stress time ' $t_s$ ' for the input-stress voltage ' $V_{STRESS}(i)$ ' and working-stress temperature ' $T_{STRESS}(j)$ '. The value ' $[\Delta P_{NBTI}(i,j)/(t-t_0)]*t_s$ ' will be positive and added to the initial value $P_{NBTI}(t_0)$ in case the performance parameter 'P' value has increased over the stress time ' $t_s$ '. On the other hand, if the performance parameter 'P' is decreased over the stress time ' $t_s$ ' then the value of ' $[\Delta P_{NBTI}(i,j)/(t-t_0)]*t_s$ ' will be negative and subtracted from the initial value $P_{NBTI}(t_0)$ . Furthermore, if the input-stress conditions are varying over the stress time ' $t_s$ ' then the total stress time ' $t_s$ ' can be divided into 'n' time points ( $t_0$ , $t_1$ , $t_2$ , ...... $t_{n-1}$ , $t_n$ ) where during each time interval E R 4 $(t_1 - t_0, t_2 - t_1, \dots, t_n - t_{n-1})$ the input-stress voltage ' $V_{STRESS}(i)$ ' and the working-stress temperature ' $T_{STRESS}(j)$ ' are constant. In this case equation (4.18) can be rewritten as: $$P_{NBTI}(t_1) = P_{NBTI}(t_0) + \left[ \frac{\Delta P_{NBTI}(i,j)}{t - t_0} \right] * t_1$$ $$P_{NBTI}(t_2) = P_{NBTI}(t_1) + \left[ \frac{\Delta P_{NBTI}(i,j)}{t - t_0} \right] * t_2$$ $$P_{NBTI}(t_n) = P_{NBTI}(t_{n-1}) + \left[\frac{\Delta P_{NBTI}(i,j)}{t - t_0}\right] * t_n$$ (4.19) The result of the above set of equations is the total change in the performance parameter 'P' due to the NBTI effect for the stress time ' $t_s$ ', being the cumulative sum of each individual change during each time interval with their individual stress conditions over its initial value. This provides the basis of our proposed approach where design-stage simulations along with the aging effects (NBTI etc.) for a potential critical performance parameter 'P' at each individual input-stress voltage $V_{STRESS}(i)$ and working-stress temperature $T_{STRESS}(j)$ are used to acquire a set of degradation rate values. This set of values can then be used at system level for indirectly estimating degradation of the performance parameter 'P' by knowing the corresponding input-stress voltage $V_{STRESS}(i)$ and the working-stress temperature $T_{STRESS}(j)$ values as discussed in the next section. ### 4.9.2 Indirect Reliability Estimation Approach As discussed in sections 4.3 and 4.4 the reliability of an electronic system can be related to its system-level performance parameters. Any change from the designed specifications (boundaries) of its system-level performance parameters will provide an estimate of its reliability. Therefore, measuring degradations in the system-level performance parameters directly or indirectly during the operational life will provide the basis for an estimate of the aging effects and hence a technique to estimate the reliability of the system. Indirectly estimating the degradation in system-level performance parameters and calculating reliability as discussed in previous sections will provide the basic methodology of indirectly estimating the reliability of systems during their operational life. The selection of the system-level performance parameters will be application dependent and can be divided into different categories based on their sensitivity to aging effects as discussed in section 4.5. The critical system-level performance parameter, for example 'P', for a particular application that is most sensitive to aging effects can be selected as the best indicator for reliability estimations via aging simulations at design stage. Similarly as discussed in section 4.6, the time it requires to change the performance parameter 'P' beyond its designed specifications (i.e. $P_{\min}$ and $P_{\max}$ ), defined as the functional failure or the instantaneous-time-to-failure (iTTF), can be regarded as a *quantitative* means of estimating the reliability at a particular time. The larger the remaining life time before it goes into failure, the higher will its reliability be and vice versa. Let during stress time $t_s$ the performance parameter P change from $P_{NBTI}(t_0)$ to $P_{NBTI}(t_s)$ then as discussed in section 4.6 (equations (4.1) and (4.2)) the time it requires to change 'P' beyond its specifications, called its reliability $R(t_s) \cong$ $iTTF(t_s)$ at time 't<sub>s</sub>', will be given by: $$R(t_s) \cong iTTF(t_s) = \left[\frac{P_{max} - P_{NBTI}(t_s)}{P_{NBTI}(t_s) - P_{NBTI}(t_0)}\right] * t_s$$ $$\tag{4.20}$$ in case 'P' has increased during the time interval ' $t_s - t_0$ ', provided that $P_{NBTI}(t_s) \neq$ $P_{NBTI}(t_0)$ . In case 'P' has decreased during the time interval ' $t_s - t_0$ ', provided that $P_{NBTI}(t_s) \neq P_{NBTI}(t_0)$ , then it will be given by: $$R(t_s) \cong iTTF(t_s) = \left[ \frac{P_{NBTI}(t_s) - P_{min}}{P_{NBTI}(t_s) - P_{NBTI}(t_0)} \right] * t_s$$ $$(4.21)$$ The next section will discuss the unity feedback amplifier as an example target system. The design-stage aging simulations for a potential critical performance parameter, sensitive to aging effects (here only the NBTI effect is considered), have been used to extract a set of values. These values are used with our indirect approach at system level to estimate the performance parameter degradation and hence the reliability of the target system. #### 4.9.2.1 CALCULATIONS FOR AN EXAMPLE TARGET SYSTEM In order to explain the presented idea of indirectly estimating the reliability of a system, a unity feedback amplifier has been considered as an example target system. Unity feedback amplifiers, commonly known as buffer amplifiers or simply buffers, are frequently used in integrated circuit design for electrical impedance transformation from one circuit to another circuit. The output offset voltage $V_{OS}$ of these buffer amplifiers is sensitive to different degradation mechanisms including the NBTI effect [Far13]. Therefore, while selecting buffer amplifiers for a particular application, the offset voltage $(V_{OS})$ at the output is considered to be an important performance parameter. The output offset voltage 'Vos', being sensitive to different aging mechanisms, must be monitored and calibrated/corrected to its nominal value during its operational life. Therefore, a unity feedback amplifier has been designed in 65nm technology and the corresponding output offset has been assumed to be the critical performance parameter as shown in Figure 4.18. In order to extract the offset degradation values, circuit-level and NBTI-related aging simulations have been conducted using Cadence virtuoso and the RelXpert environment respectively at the design stage. Similarly, the NBTI related degradations are based on the AgeMOS SPICE models of the 65nm TSMC PDK library. Figure 4.18 shows the block diagram of the target unity feedback amplifier. The output offset voltage of this unity feedback amplifier has been extracted at two different points in time. Figure 4.18: An example system for measuring the output offset voltage $(V_{OS})$ of a unity feedback amplifier. Figure 4.19: Output offset voltage ( $V_{OS}$ ) of the unity feedback system over a range of input-stress voltages $(0.0V \le V_{PP} \le 1.0V)$ and working-stress temperatures $(0.0^{\circ}C \le T_{STRESS} \le 125.0^{\circ}C)$ at time 't = 0'. - 1- First at time ' $t_0$ ' (t = 0) without any degradations. At this point in time, the possible range of input-stress-voltage ' $V_{PP}$ ' and working-stres temperature ' $T_{STRESS}$ ' that could be possible in the working environment of an application system has been considered within (0.0 1.0)V and $(0.0 125.0)^{\circ}C$ respectively. Figure 4.19 shows the corresponding output offset voltage ( $V_{OS}$ ) of the unity feedback amplifier as a function of the input-stress voltages $V_{PP}$ (0.0 1.0)V and the working-stress temperature $T_{STRESS}$ $(0.0 125.0)^{\circ}C$ . - 2- Secondly, after twenty years of continuous NBTI effect for each inputstress voltage $V_{PP}$ (0.0 - 1.0)V and working stress temperature $T_{STRESS}$ (0.0 - 125.0)°C. The above values of the output offset voltage, before the NBTI effect at time 't<sub>0</sub>' and after the continuous NBTI effect of twenty years over the possible range of stressors, are further used to extract the change or degradation in the output offset voltage at each individual stressors as shown in Figure 4.20. This figure shows that the Figure 4.20: Change in the output offset voltage ( $V_{OS}$ ) due to the NBTI effect of the unity feedback system which is continuously stressed for twenty years over a range of input-stress voltages ( $0.0V \le V_{PP} \le 1.0V$ ) and working-stress temperatures ( $0.0^{\circ}\text{C} \le T_{STRESS} \le 125.0^{\circ}\text{C}$ ). total change or degradation in the output offset voltage due to the NBTI effect stressed for twenty years at each individual input-stress voltages $V_{PP}$ (0.0 – 1.0)V and working-stress temperatures $T_{STRESS}$ (0.0 – 125.0)°C is not unidirectional. For small values of input-stress voltage $V_{PP}$ , the output offset voltage $V_{OS}$ is increasing; it means the degradation is in the positive direction as the temperature $T_{STRESS}$ is increasing. Whereas, on the other hand, for higher values of input-stress voltage the output offset voltage is decreasing, meaning the degradation is in the negative direction as the temperature is increasing. This contradicts the usual concept of aging effects that the degradation due to aging mechanisms (NBTI etc.) will be unidirectional; they are either increasing or decreasing. Here it is clear that the output offset voltage could increase or decrease over the stress time depending on the input-stress conditions. This also highlights the importance of monitoring reliability during the operational life of a system despite the usual concept of reliability estimations using design stage simulations. Figure 4.20 provides an important set of values that can be used at system level for indirectly estimating the degradation in the output offset voltage. These values give the relationship between the stressors and the corresponding degradation over the stress time. Therefore, by knowing the input-stress voltage $V_{PP}(i)$ and working-stress temperature $T_{STRESS}(j)$ , the corresponding degradation in the output offset voltage $\Delta V_{OS-NBTI}(i,j)$ can be estimated indirectly by using equation (4.19). $$V_{OS-NBTI}(t_n) = V_{OS-NBTI}(t_{n-1}) + \left[\frac{\Delta V_{OS-NBTI}(i,j)}{t - t_0}\right] * t_n$$ (4.22) Therefore, the total degradation in the output offset voltage of the unity feedback amplifier at time ' $t_1$ ' will be the sum of initial offset voltage value at time ' $t_0$ ' and the change in offset voltage value due to the NBTI effect over the stress time interval ' $t_1-t_0$ ' assuming that the stress conditions during this time interval are constant. In case of varying stressors, this assumption can be made valid by assuming the stress time H A P Τ E R 4 interval to be sufficiently short that the stress conditions remain constant over that time interval. Hence the total degradation over the stress time ' $t_s$ ' will be the cumulative sum over each individual stress time interval with their individual stress conditions. Furthermore, if ' $V_{OS-min}$ ' and ' $V_{OS-max}$ ' represent the design-stage functional specification boundaries for the output offset voltage ' $V_{OS}$ ' then according to equations (4.20) and (4.21) the reliability at any time point ' $t_n$ ' will be given by: $$R(t_n) \cong iTTF(t_n) = \left[ \frac{V_{OS-max} - V_{OS-NBTI}(t_n)}{V_{OS-NBTI}(t_n) - V_{OS-NBTI}(t_{n-1})} \right] * t_n$$ (4.23) in case the output offset voltage ' $V_{\rm OS}$ ' has increased during the time interval ' $t_n - t_{n-1}$ ', provided that $V_{OS-NBTI}(t_n) \neq V_{OS-NBTI}(t_{n-1})$ . In case the output offset voltage ' $V_{\rm OS}$ ' has decreased during the time interval ' $t_n - t_{n-1}$ ', provided that $V_{OS-NBTI}(t_n) \neq V_{OS-NBTI}(t_{n-1})$ , then it will be given by: $$R(t_n) \cong iTTF(t_n) = \left[ \frac{V_{OS-NBTI}(t_n) - V_{OS-min}}{V_{OS-NBTI}(t_n) - V_{OS-NBTI}(t_{n-1})} \right] * t_n$$ (4.24) This means that by having the initial value of the output offset voltage ${}^{\prime}V_{OS-NBTI}(t_0){}^{\prime}$ at time ${}^{\prime}t_0{}^{\prime}$ and the value of the stressors (voltage and temperature) during the life-time of the unity-feedback amplifier, the corresponding degradation in the output offset voltage can estimated indirectly. This degradation in the output offset voltage can be further used to estimate the reliability of the unity-feedback amplifier as discussed in equations (4.23) and (4.24). The next section will present the simulation results of the same unity-feedback system where the simulated design-stage degradation values, as described in Figure 4.20, have been stored in a database (memory). The degradation in the output offset value due to the NBTI effect and the corresponding reliability of the system have been estimated indirectly by using equations (4.22) - (4.24). #### 4.9.2.2 SIMULATION SETUP A simulation setup has been constructed in the LabVIEW environment in order to investigate the proposed idea of indirectly estimating the reliability during the operational life of a system. The simulation setup is based upon the degraded values extracted from the design-stage simulations for a unity-feedback amplifier designed in the 65 nm technology node as shown in Figure 4.20. These design-stage degraded values of the output offset voltage $V_{OS-NBTI}$ as a result of the NBTI effect for each input-stress voltage and working-stress temperature over a continuous stress period of twenty years have been stored in the database. These values are further used to extract the degradation per hour in the output offset voltage $V_{OS-NBTI}$ according to equation (4.17) and hence the reliability of the unit-feedback amplifier as discussed in equations (4.23) and (4.24). The size of the database is proportional to the linearity of the degradation behaviour. A linear behaviour will require a smaller database whereas a non-linear behaviour will require a larger database. Furthermore, in case of a linear relation between the degradation values per unit hour to the stress conditions, a linear degradation equation can be stored along with its coefficients instead of storing values for each individual input-stress voltage $V_{\rm PP}$ and working-stress temperature $T_{\rm STRESS}$ . Mathematically, ND/256 Kbytes (ND\*4/1024) will be required for storing ND number of data points in 32-bit format. In this simulation setup, the input-stress voltage $V_{PP}$ can be generated randomly with an initial starting value within the provided maximum and minimum stress limits. Similarly, the working-stress temperature $T_{STRESS}$ can also be generated randomly with an initial starting value varying between the provided maximum and minimum stress limits as shown in Figure 4.23. These randomly generated input-stress voltage and working-stress temperature values have been generated and used in this simulation setup in order to truly represent a real working environment. By knowing the constant/average input-stress voltage and the working-stress temperature per hour (as an illustration) the corresponding degradation value per hour (as an illustration) can be extracted (obtained) from the stored database. Having this degradation value per hour ' $\Delta V_{OS-NBTI}(i,j)/t-t_0$ ' and the initial starting value ' $V_{OS-NBTI}(t_0)$ ', the new value of the output offset voltage after one hour can be calculated as specified in equation (4.22). That is, by simply adding or subtracting the degraded value per hour ' $\Delta V_{OS-NBTI}(i,j)/t-t_0$ ' from the initial starting value ' $V_{OS-NBTI}(t_0)$ ' the new value after one hour ' $V_{OS-NBTI}(t_1)$ ' can be calculated. This process of continuously subtracting or adding degraded values per hour will ultimately lead to the net degradation of the output offset voltage over the stress period of twenty years (175200 hr = 20\*365\*24). Similarly, by specifying $V_{OS-min}$ and $V_{OS-max}$ as the minimum and maximum design specifications of the output offset voltage ' $V_{OS}$ ' for a particular application the reliability $R(t) \cong iTTF(t)$ of the unity-feedback system can be calculated by using equations (4.23) and (4.24). #### 4.9.2.3 SIMULATION RESULTS Three different cases based upon the stressors have been considered to validate the presented idea of indirectly estimating the reliability of a unity-feedback system as discussed below: ### Case 1: Constant Stressors ( $V_{PP} = 0.0V$ and $T_{STRESS} = 125.0$ °C) Figure 4.21 shows the simulation results of the output offset voltage ' $V_{OS}$ ' of the unity-feedback system which is continuously stressed over twenty years with zero input-stress voltage ( $V_{\rm PP}=0.0V$ ) being one extreme of input-stress voltage and 125.0°C of working-stress temperature ( $T_{\rm STRESS}=125.0$ °C). This figure shows that the degradation in the output offset voltage ' $V_{OS}$ ' due to the NBTI effect is linearly increasing over the stress time of twenty years and the total degradation during this stress time is equal to 1.2 V. This is in accordance with the design-stage simulation results of Figure 4.20 which shows that the total degradation in the output offset voltage over twenty years with $V_{\rm PP}=0.0V$ and $T_{\rm STRESS}=125.0$ °C is 1.2 V. Similarly, in the case $V_{OS-min}=94.0~mV$ and $V_{OS-max}=98.0~mV$ represents the designed functional specification boundaries of the output offset voltage for a particular application, then the reliability ( $\cong iTTF(t)$ ) of the system has also linearly decreased from 202939 hours to 27739 hours. 94 Figure 4.21: Simulation results of the output offset voltage $(V_{OS})$ of the unity feedback system which is continuously stressed for twenty years (175200 hr) with 0.0V input-stress voltage $(V_{PP} = 0.0V)$ being one extreme of input-stress voltage and 125.0°C working-stressed temperature $(T_{STRESS} = 125.0$ °C). Output offset voltage $(V_{OS})$ and reliability $(\cong iTTF(t))$ are linearly increasing and decreasing respectively over the stress time due to the NBTI effect. Figure 4.22: Simulation results of the output offset voltage $(V_{OS})$ of the unity feedback system which is continuously stressed for twenty years (175200 hr) with I.0V input-stress voltage $(V_{PP} = 1.0V)$ being other extreme of input-stress voltage and 125.0°C working-stressed temperature $(T_{STRESS} = 125.0$ °C). Output offset voltage $(V_{OS})$ and reliability $(\cong iTTF(t))$ are linearly increasing and decreasing respectively over the stress time due to the NBTI effect. #### Case 2: Constant Stressors ( $V_{PP} = 1.0V$ and $T_{STRESS} = 125.0$ °C) Similarly, Figure 4.22 represents the simulation results of the output offset voltage $V_{OS}$ of the unity feedback system which is continuously stressed over twenty years with one volt input-stress voltage $(V_{PP}=1.0V)$ being the other extreme of input-stress voltage and 125.0°C working-stress temperature $(T_{STRESS}=125.0^{\circ}\text{C})$ . This time the degradation in the output offset voltage $V_{OS}$ due to the NBTI effect is linearly decreasing over the stress time of twenty years and the total degradation during this stress time is equal to 2.1V. This is again in accordance with the design-stage simulation results of Figure 4.20 which shows that the total degradation in the output offset voltage over twenty years with $V_{PP}=1.0V$ and $T_{STRESS}=125.0^{\circ}\text{C}$ is 2.1V. Furthermore, in the case $V_{OS-min}=94.0~mV$ and $V_{OS-max}=98.0~mV$ represents the designed functional specification boundaries of the output offset voltage for a particular application, then the reliability ( $\cong iTTF(t)$ ) of the system has also linearly decreased from 190529 hours to 15329 hours. ### Case 3: Random Stressors $(0.0V \ge V_{PP} \le 0.25V$ and 70.0°C $$\geq T_{STRESS} \leq 125.0$$ °C) Finally, Figure 4.23 represents the simulation results of the output offset voltage of the unity feedback system which is randomly stressed (this is just as an illustrative purpose where stress conditions are randomly changing each hour) over twenty years with varying input-stress voltages $(0.0V \ge V_{PP} \le 0.25V)$ and varying working-stress temperatures $(70.0^{\circ}\text{C} \ge T_{STRESS} \le 125.0^{\circ}\text{C})$ . This time, the degradation in the output offset voltage due to the NBTI effect is also assumed random over the stress time of twenty years. Sometimes it is increasing and other times it is decreasing depending on the input-stress values (see Figure 4.20). Similarly in the case $V_{OS-min} = 94.0 \text{ mV}$ and $V_{OS-max} = 98.0 \text{ mV}$ represent the designed functional specification boundaries of the output offset voltage for a particular application, then the reliability ( $\cong iTTF(t)$ ) of the system is also fluctuating randomly between 45948E+6 hours and 200921 hours. Randomly selected values are just to validate the proposed approach, which will of course differ from real-word stress-condition profiles. These indirect reliability estimations during the operational life of a system can be further used to take the necessary repair actions in order to enhance the dependability of the whole system as discussed in section 4.7.2. Figure 4.23: Simulation results of the output offset voltage $(V_{OS})$ of the unity feedback system which is randomly stressed for twenty years (175200 hr) with varying input-stress voltage $(\theta.0V \ge V_{PP} \le 0.25V)$ and varying working-stress temperature $(70.0^{\circ}C \ge T_{STRESS} \le 125.0^{\circ}C)$ . Output offset voltage $(V_{OS})$ and reliability $(\cong iTTF(t))$ are randomly increasing or decreasing over the stress time due to the NBTI effect. 4 #### 4.10 CONCLUSIONS In this chapter, it was shown how regularly estimating reliability during operational life of a system can be used to enhance the system dependability. Variations and degradations in the system-level parameters are directly influenced by device-level parameters that have been used to estimate the reliability during the operational life of a system. These degradations further depend on the initial values of parameters due to fabrication-related process variations and the architecture of the system. This makes the reliability estimation a complicated process during operational life. By using conventional techniques of estimating the reliability of a system at the design stage, one cannot handle these real-time variations that are a function of initial values. Therefore, a workflow based upon regular monitoring and storage of system-level parameters is proposed for estimating the reliability of these systems during their operational life. These reliability estimations are further used for intelligently making decisions on digital tuning and replacement mechanisms. Based on derived formulas, an example target system has been simulated in a LabVIEW environment. These simulations validate the proposed idea that by regularly monitoring the most sensitive performance parameter(s) with degradation behaviour (linear or non-linear) with regard to aging effects and intelligently taking the right decisions at the right time the system dependability during its operational life can be better managed and extended. The price paid is in terms of area overheads for monitoring and storage facilities. However, this is essential for a more dependable design. Furthermore, in order to avoid potential circuit overloading problems while directly interacting with the critical (internal) nodes of the system, an indirect technique has been proposed. This indirect technique estimates the reliability of a system based on the regular monitoring of input-stress voltages and working-stress temperatures. Designstage aging simulations over a range of input-stress voltages and working-stress temperatures have been used to get a set of degradation values for a critical performance parameter, sensitive to aging effects, per unit time. These values are then used at system level to estimate the degradation of that particular critical performance parameter and hence the reliability of the system due to aging effects. The simulation results conducted in a LabVIEW environment for a simple unity-feedback system for constant and random input-stress conditions show that the proposed technique of estimating reliability during operational life is a valid technique. This idea can be further extended to bigger and more complex systems by dividing the whole system into simpler subsystems and adopting the procedure explained in this paper. In case the system cannot be divided into simpler sub-systems any more, the proposed procedure can be adopted individually for each complex system. Furthermore, the effectiveness of the proposed technique depends on the quality of the aging simulations conducted at design stage. Having close similarity between the design-time aging simulations and the real world aging behaviour will result close to 100% accuracy of the proposed technique. #### 4.11 REFERENCES [Ala07] M. Alam, K. Kang, B.C. Paul, and K. Roy, "Reliability- and Process-Variation Aware Design of VLSI Circuits," in IEEE Int. Symp. Physical and Failure Analysis of Integrated Circuits, pp. 17-25, 2007. [Avi01] A. Avizienis, J-C. Laprie, and B. Randell, "Fundamental concepts of dependability", in Laboratory for Analysis and Architecture of Systems (LAAS-CNRS) Technical Report no. 01-145, Apr. 2001. [Chi07] C. Chiang, and J. Kawa, "Design for Manufacturability and Yield for Nano-scale CMOS," Springer Publishing Press, ISBN 978-1-4020-5187-6, pp. 14–15, 2007. [Chu12] S. Chun, J. D.S. Spuentrup, and J.N. Burghartz, "A Novel Built-in Aging Detection Architecture for Mixed-Signal Integrated Circuits," in IEEE Int. Conf. Ph.D. Research in Microelectronics and Electronics (PRIME), pp.1-4, 2012. [Eck93] K.R. Eckersall, P.L. Wrighton, I.M. Bell, B.R. Bannister, and G.E. Taylor, "Testing mixed signal ASICs through the use of supply current monitoring," in Proceedings of European Test Conference, pp. 385-391, 1993. [Far13] F.A. Farag, "High performance CMOS buffer amplifier with offset cancellation," in IEEE Int. Saudi International Electronics, Communications and Photonics Conference (SIECPC), pp. 1-4, 2013. [Gao10] M. Gao, Z. Ye, Y. Peng, Y. Wang, and Z. Yu, "A comprehensive model for gate delay under process variation and different driving and loading conditions," in IEEE Int. Symp. on Quality Electronic Design (ISQED), pp. 406-412, 2010. [Gie11] G. Gielen, E. Maricau, and P. De Wit, "Analog circuit reliability in sub-32 nanometer CMOS: Analysis and mitigation," in IEEE Int. Conf. Design, Automation & Test in Europe (DATE), pp. 1-6, 2011. [Jha05] N.K. Jha, P.S. Reddy, D.K. Sharma, and V.R. Rao, "NBTI degradation and its impact for analog circuit reliability," in IEEE Trans. Electron Devices, Vol. 52, No. 12, pp. 2609- 2615, 2005. [Kan07a] K. Kang, M.A. Alam, and K. Roy, "Estimation of NBTI Degradation using IDDQ Measurement," in IEEE Int. Reliability Physics Symposium, pp. 10-16, 2007. [Kan07b] K. Kang, K. Kim, A.E. Islam, M.A. Alam, K. Roy, "Characterization and Estimation of Circuit Reliability Degradation under NBTI using On-Line IDDQ Measurement," in IEEE Design Automation Conference (DAC), pp. 358-363, 2007. [Kha11] M.A. Khan, and H.G. Kerkhoff, "A system-level platform for dependability enhancement and its analysis for mixed-signal SoCs," in IEEE Int. Symp. on Design and Diagnostics of Electronic Circuits & Systems (DDECS), pp. 17-22, 2011. [Kha13a] M.A. Khan, and H.G. Kerkhoff, "An Indirect Technique for Estimating Reliability of Analog and Mixed-Signal Systems during Operational Life," in IEEE Int. Symp. Design and Diagnostics of Electronic Circuits & Systems (DDECS), pp. 159-164, 2013. [Kha13b] M.A. Khan, and H.G. Kerkhoff, "The Essence of Reliability Estimation during Operational Life for Achieving High System Dependability," in IEEE Euromicro Conference on Digital System Design (DSD), pp. 575-581, 2013. [Kri10] S.K. Krishnappa, H. Singh, and H. Mahmoodi, "Incorporating Effects of Process, Voltage, and Temperature Variation in BTI Model for Circuit Design," in IEEE Latin American Symposium on Circuits and Systems, pp. 236-239, 2010. [Kum06] S.V. Kumar, C.H. Kim, S.S. Sapatnekar, "An Analytical Model for Negative Bias Temperature Instability," in IEEE Int. Conf. on Computer-Aided Design (ICCAD), pp. 493-496, 2006. C [Lat11] M.A.A. Latif, N.B.Z. Ali, and F.A. Hussin, "A case study of process-variation effect to SoC analog circuits," in IEEE Int. Conf. Recent Advances in Intelligent Computational Systems (RAICS), pp. 520-523, 2011. [Lee03] Y-H Lee, et al., "Effect of pMOST bias-temperature instability on circuit reliability performance," in IEEE Int. Electron Devices Meeting (IEDM), pp. 14.6.1-14.6.4, 2003. [Lin06] H. Lin, and D.K. Chang, "A Low-Voltage Process Corner Insensitive Subthreshold CMOS Voltage Reference Circuit," in IEEE Int. Conf. on Integrated Circuit Design and Technology (ICICDT), pp. 1-4, 2006. [Lu09] Y. Lu, et al., "Statistical reliability analysis under process variation and aging effects," in IEEE Int. Design Automation Conference (DAC), pp. 514-519, 2009. [Mar09] E. Maricau, and G. Gielen, "Efficient reliability simulation of analog ICs including variability and time-varying stress," in IEEE Int. Conf. Design, Automation & Test in Europe (DATE), pp. 1238-1241, 2009. [Mar11] E. Maricau, et al., "A compact NBTI model for accurate analog integrated circuit reliability simulation," in Proceedings of the European Solid-State Device Research Conference (ESSDERC), pp. 147-150, 2011. [Pan10] X. Pan, and H. Graeb, "Reliability analysis of analog circuits by lifetime yield prediction using worst-case distance degradation rate," in IEEE Int. Symp. Quality Electronic Design (ISQED), pp. 861-865, 2010. [Pau05] B.C. Paul, K. Kang, H. Kufluoglu, M.A. Alam, and K. Roy, "Impact of NBTI on the temporal performance degradation of digital circuits," in IEEE Electron Device Letters, Vol. 26, No. 8, pp. 560- 562, 2005. [Sch03] D.K. Schroder and J.A. Babcock, "Negative bias temperature instability: Road to cross in deep submicron silicon semiconductor manufacturing," in Journal of Applied Physics, Vol. 94, No. 1, pp. 1–18, 2003. [Vat06] R. Vattikonda, W. Wang, and Y. Cao, "Modeling and minimization of PMOS NBTI effect for robust nanometer design," in IEEE Int. Design Automation Conference (DAC), pp. 1047-1052, 2006. [Wan07a] W. Wang, et al., "The Impact of NBTI on the Performance of Combinational and Sequential Circuits," in IEEE Design Automation Conference (DAC), pp. 364-369, 2007. [Wan07b] W. Wang, Z. Wei, S. Yang, and Y. Cao, "An efficient method to identify critical gates under circuit aging," in IEEE Int. Conf. on Computer-Aided Design (ICCAD), pp. 735-740, 2007. [Wan08] W. Wang, et al., "Statistical prediction of circuit aging under process variations," in IEEE Int. Custom Integrated Circuits Conference (CICC), pp. 13-16, 2008. [Ye10] Y. Ye, S. Gummalla, C.C. Wang, C. Chakrabarti, and Y. Cao, "Random variability modeling and its impact on scaled CMOS circuits," in Journal of Computational Electronics, Vol. 9, No. 3-4, pp. 108-113, 2010. [Zha08] B. Zhang, and M. Orshansky, "Modeling of NBTI-Induced PMOS Degradation under Arbitrary Dynamic Temperature Variation," in Int. Symp. on Quality Electronic Design (ISQED), pp. 774-779, 2008. [Zja05] A. Zjajo, and J.P. de Gyvez, "Evaluation of signature-based testing of RF/analog circuits," in IEEE Int. European Test Symposium (ETS), pp. 62-67, 2005. # DIFFERENTIATING BETWEEN SHORT-TERM AND LONG-TERM DEPENDABILITY ISSUES ABSTRACT — In the previous chapters it was shown that degradation in system-level performance parameters can be used to estimate the reliability which is crucial to improve the dependability during the operational life of a system. In this chapter, it will be shown that the operating temperature and supply voltage can lead to short-term and long-term dependability issues; therefore they must be addressed separately. System dependability consists of a set of attributes, of which the reliability heavily depends on operating temperature and supply voltage. Any change beyond the designed specifications of operating temperature and supply voltage may degrade the system performance and could result in system reliability issues and hence dependability problems. These reliability issues can be short-term variations and are solvable if the system returns back to its normal operational temperature and supply voltage. Therefore, this kind of short-term reliability issues should be differentiated from longterm reliability effects resulting from aging mechanisms. These long-term reliability issues are a function of stress time and have a cumulative nature. This differentiation turn out to be essential to better manage the system dependability during its operational life as will be shown later on. In this way the two issues can be separated and can be addressed separately. This separation of two reliability phenomena requires regular monitoring of the system operating temperature and the supply voltage during its operational life. It has been addressed in our proposed hardware architecture and workflow by taking this monitoring into account. It tackles them separately and carries out proper actions in order to enhance the system dependability. The simulation results for a target system carried out in the LabVIEW environment fully support the above idea of separating the issues and addressing them separately. #### 5.1 Introduction As mentioned in the previous chapters, the dependability of electronic systems is becoming an important design concern as the technology is shrinking towards nanometer limits. On one side different physical degradation mechanisms like negative bias temperature instability (NBTI), positive bias temperature instability (PBTI), hot carrier injection (HCI), time-dependent-dielectric-breakdown (TDDB), and electromigration (EM) are becoming prominent [Lew09, Sch03, Jha05, Pau05]. On the other hand, fabrication related process variations, supply voltage, and operational temperature (PVT) related variations are also affecting the performance of these electronic systems [Lu09, Lat11, Kri10, Uns06, Kum06, Ali06]. The performance degradation of these electronic circuits under statistical variations is a popular research area for last couple of years [Wan08]. The interesting observation to note here is that some of these performance degrading effects have a common source of disturbance. For example, NBTI and PBTI are heavily dependent on the stress temperature and input stress voltage causing serious performance degradations in the electronic systems [Lew09, Sch03, Jha05]. Similarly, operational-temperature and supply-voltage variations, on the other hand, also individually contribute to the performance degradation of these electronic systems. The main difference is in terms of the parameter "time". Temperature and voltage variations beyond specifications can cause serious performance degradations that could be of instantaneous, temporary or short-term nature as will be explained in the next sections. Therefore, these short-term performance degradations could diminish if the corresponding operational temperature and voltage values return back to their nominal values. Furthermore, the time duration for which these variations will remain effective can cause long-term aging (e.g. NBTI, PBTI) effects as well. Those effects could be contributing to the permanent performance degradation of the electronic systems at a particular time. The reason is the cumulative nature of these degradation mechanisms over the operating time which will determine their contribution in degrading the performance of these electronic systems. Therefore, variations in the system reliability and hence the system dependability as a function of operational temperature and voltage will be of short-term as well as long-term nature. In order to improve the system dependability, these short-term and long-term effects should be addressed separately. In this chapter, these two different kinds of effects and how they can influence the system performance during its operational life will be explained. Furthermore, it will be investigated how operational-temperature and supply-voltage variations can cause short-term (non-permanent) and long-term (could be permanent) performance degradations. In addition, it will be researched how we can differentiate between these two performance degradation scenarios and how this differentiation could be helpful in achieving a high system dependability during its operational life. The remainder of this chapter is organized as follows. Section 5.2 will briefly describe the importance of supply-voltage (V) and operational-temperature (T) variations in affecting the performance of both digital and analog circuits. Sections 5.3 and 5.4 will discuss the importance of separating the NBTI (the only aging mechanism considered in this chapter) and VT variations and the corresponding necessary requirements to enhance the system dependability respectively. The basic idea of the resulting proposed hardware architecture (an extension of Figure 4.16), the corresponding workflow, its working principle and respective consequences are presented in section 5.5. The simulation environment and simulation results are given in section 5.6. The conclusions and important references are presented at the end of the chapter. #### 5.2 SUPPLY-VOLTAGE AND TEMPERATURE VARIATIONS With technology scaling, both $V_{DD}$ and $V_{th}$ reduce, making it critical for electronic designers, especially for analog designers, as shown in Figure 5.1 [Bas09]. This figure E R 5 Figure 5.1: Power supply $(V_{DD})$ and threshold voltage $(V_{th})$ vs technology node [Bas09]. shows that the distance $(V_{DD} - V_{th})$ , known as the free-voltage space for analog design, is also reducing. Therefore, any variations in $V_{DD}$ or $V_{th}$ will influence this free voltage space and hence will affect the performance of electronic systems. Similarly, MOSFET device characteristics like threshold voltage, carrier mobility, and saturation velocity are heavily affected by temperature variations [Kum06] therefore varying the performance of associated electronic systems. For example, in automotive applications, electronic systems attached to automobile engines usually operate at higher temperature variations ranging from $-40^{\circ}C$ to $150^{\circ}C$ [Joh04] that could directly affect the performance of associated electronic systems. ## 5.2.1 SUPPLY-VOLTAGE AND TEMPERATURE VARIATIONS IN DIGITAL SYSTEMS In digital systems, the propagation delay is a function of drain saturation current provided by active transistors which further depends on the supply voltage $(V_{DD})$ and operational temperature [Kum06]. Therefore, any variation in the supply voltage or operational temperature will affect the propagation delay. In other words, the digital system performance (e.g. speed) is highly affected by both operational temperature and supply voltage. The important thing to note here is that this variation in delay does not include the long-term aging effects (NBTI etc.). The duration of time for which these temperature and voltage variations are present will further decide the inclusion of these aging effects. For example, Figures 5.2(a), 5.2(b) and 5.2(c) show the percentage of change in the delay of a five-stage ring oscillator over two years due to aging (NBTI), operational-temperature change from 25°C to 100°C and supply-voltage change from 0.5V to 1.0V respectively. Figure 5.2(a) shows that the aging effects (NBTI etc.) over the same period of time are dependent on the operational temperature and the supply voltage. The higher the operational temperature or supply voltage are, the larger are the corresponding aging effects and vice versa. Similarly, Figure 5.2(b) indicates that at a constant supply voltage, the percentage of change in the delay of a five-stage ring oscillator, only due to the operational temperature changes from $25^{\circ}C$ to $100^{\circ}C$ , is also changing with stress time. This means the change in delay as a result of a temperature Figure 5.2: Percentage of change in the delay of a five-stage ring oscillator due to change in (a) NBTI effect at two different temperatures and two different supply voltages, (b) the operating temperature from 25°C – 100°C at two different supply voltages, and (c) the supply voltage from 0.5V – 1.0V at two different temperatures versus stress time respectively (extracted from [Kri10]). change is also time dependent. Furthermore, Figure 5.2(c) shows that if one keeps the temperature constant during the stress time then still the percentage of change in the delay of a five-stage ring oscillator, only due to the supply voltage change from 0.5V to 1.0V, is time dependent. This means the change in delay as a result of supply voltage change is also time dependent. The details from these figures provide the useful information that all of the three aspects, namely NBTI, operational temperature and supply voltage, have their own independent role in degrading the delay of a five-stage ring oscillator. Therefore, by having an individual control on these changes, the corresponding degradation in the system performance can be controlled. ## 5.2.2 SUPPLY-VOLTAGE AND TEMPERATURE VARIATIONS IN ANALOG SYSTEMS Similar to digital systems, analog circuit performance is also sensitive to supply voltage (V) and operating temperature (T). For example, Figures 5.3(a), 5.3(b), and 5.3(c) show the change in open-loop gain of an amplifier designed in 65nm technology over twenty years due to aging (only NBTI considered), operational-temperature change from $25^{\circ}C$ to $125^{\circ}C$ , and supply voltage change from 1.1V to 1.3V respectively. Figure 106 R 5 Figure 5.3: Change in open-loop gain of an opamp, stressed at different temperatures, due to (a) NBTI effect, (b) change in operating temperature from $25^{\circ}C - 125^{\circ}C$ (c) change in supply voltage from 1.1V-1.3V versus stress time respectively. 5.3(a) shows that if the supply voltage is kept constant $(V_{DD} = 1.2V)$ then for each constant operational temperature the change in the open-loop gain of an amplifier as a result of NBTI effect is a function of the stress time. Similarly, Figure 5.3(b) indicates that if the supply voltage is kept constant ( $V_{DD} = 1.2V$ ) then for each constant stress temperature (25°C, 50°C, 75°C, 100°C, and 125°C) if the current operational temperature is changed from 25°C to 125°C then the change in the open-loop gain of an amplifier is also changing over the stress time. This means the change in the open-loop gain of an amplifier at any time which does not include the change due to the NBTI effect is also a function of the operational temperature. Furthermore, Figure 5.3(c) reveals that at each constant operational temperature (e.g. 25°C, 50°C, 75°C, 100°C, and 125°C) if the supply voltage changes from 1.1V to 1.3V then the change in the open-loop gain of an amplifier is also changing over the stress time. This means the change in the open-loop gain of an amplifier at any time, which is independent from the NBTI effect, is also a function of the supply voltage. The above discussion indicates that all three aspects, namely the NBTI, operational temperature and supply voltage, have their independent role in degrading the open-loop gain of an amplifier. Therefore, by individually controlling these aspects, the corresponding degradation in the openloop gain of the amplifier can be controlled. Figure 5.4: Change in an arbitrary performance parameter 'P' as a function of time due to VT and NBTI effects. Dotted horizontal lines show the performance margins. #### 5.2.3 THE ROLE OF SUPPLY-VOLTAGE AND TEMPERATURE VARIATION If one compares the effect of VT variation, it is clear that the change in the delay of a five-stage ring oscillator (Figures 5.2(b) and 5.2(c)) and the open-loop gain of an amplifier (Figures 5.3(b) and 5.3(c)) is higher due to the change in the operational temperature and supply voltage at a particular time as compared to the change due to the NBTI (aging) effect (Figures 5.2(a) and 5.3(a)). This means the operational-temperature and the supply-voltage variations can play a significant role in the performance of digital and analog systems. These changes in performance parameters, delay for the five-stage ring-oscillator and open-loop gain for an amplifier, are independent from the cumulative changes in performance parameters as a result of aging (NBTI) effects. In order to increase understanding, let an arbitrary performance parameter 'P' of a system undergo an arbitrary stress for a stress time 't' as shown in Figure 5.4. This means there are two types of changes in the performance parameter 'P'; the long-term change due to the NBTI effect and the short-term change due to VT variation effects. If one assumes that the dotted horizontal lines show the performance margins/boundaries for the performance parameter 'P' then the system performance parameter 'P' could exceed the maximum performance margin/boundary due to VT variations earlier (Figure 5.4 at $t_1$ ) than the NBTI effects only (Figure 5.4 at $t_2$ ). This also means that if the VT variations could be controlled by any means to normal values then the system performance parameter could also remain within the performance margins. In other words, the total life time of the system can be enhanced by controlling the VT variations to operational values. ## 5.3 THE IMPORTANCE OF SEPARATING NBTI AND SUPPLY-VOLTAGE AND TEMPERATURE VARIATIONS As discussed in Chapter 4, the reliability of an electronic system is mainly addressed in the simulation phase of the design stage. Furthermore, a number of device-level models are used to investigate the reliability issues [Mar11] in electronic systems that are essential to safely guardband the system performance for a certain life time. 5 Most importantly, reliability estimations during the operational life of a system are crucial for dependable design. Reliability of the system requires that the performance parameters should remain within the specified limits of correct operation. Therefore, by monitoring the most sensitive performance parameter(s) with regard to short-term and long-term effects one can estimate the reliability of the system as discussed in Chapters 3 and 4. At any time, if the performance parameter is within the specified limits of correct operation then the system will be functioning correctly and it will be considered reliable at that time. On the other hand if the performance parameter is not within the specified limits of correct operation then the system will be functioning incorrectly and it will be considered unreliable at that time. However, as discussed in section 5.2, this performance parameter is sensitive to operational temperature (T), supply voltage (V) and other aging effects, like NBTI, causing short-term and long-term effects. Therefore, it is essential to separate these two effects and take the necessary actions in order to enhance the reliability of the system. The separation of long-term NBTI and short-term VT variations further requires a regular monitoring mechanism of the operational temperature, supply voltage and the most sensitive performance parameter. If there are variations in the performance parameter then the corresponding values of the operating temperature and supply voltage can clarify whether these current changes in the performance parameter are due to the change in the operational temperature and supply voltage or due to the aging effects (NBTI etc.). These distinctions are further used in making decisions for selecting the appropriate repair strategies in order to enhance reliability, maintainability, availability and hence the dependability of the system. These are further discussed in the next section. #### 5.4 ENHANCING THE SYSTEM DEPENDABILITY One has to understand how to separate NBTI and VT variations and to individually control the VT variations to normal (typical) values to enhance the system dependability. Let us consider the density function of an arbitrary performance parameter 'P' of 'n' identical systems (e.g. open-loop gain 'A' of 'n' identical opamps) as shown in Figure 5.5. This density function can be further divided into three regions. The "Allowed" region of the performance parameter 'P' (e.g. $\pm 3\sigma$ of 'A') comprises the designed specification boundary region under which the system will function correctly as desired. The "Not Allowed" region of the performance parameter is the region where the performance parameter 'P' goes beyond the designed specification boundaries and the system starts disfunctioning. Although, under this condition the system is still working, the results will be unsatisfactory or undesirable. Similarly, the permanent failure region of the performance parameter is the region where the performance parameter 'P' is not only beyond the designed specifications but the system stops working and hence results in a permanent failure (permanent in the sense that it cannot be repaired except replacing it with a fresh one). The allowed region shows the normal operation but in order to enhance the dependability or to prolongate the life time of the system, the second and third regions should be avoided by counter-acting with proper actions. This can be accomplished by: Figure 5.5: The density function (DF) of an arbitrary performance parameter 'P' of 'n' identical electronic systems. - 1) **bringing back** the operating temperature and supply voltage to normal operating values for short-term (non-permanent) effects. - 2) *digitally tuning* back the performance parameters to normal specifications for long-term effects. - 3) incorporating some *fault-tolerant strategies*, like redundant systems/subsystems in the case of permanent-failures. As an example for options 1 and 2: cooling fans can be used for adjusting the operational temperature, digitally-tuneable power supplies can be used for adjusting the supply voltages, and digitally-assisted electronic systems can be used for reconfiguring or digitally adjusting the performance parameters [Cru09]. Off course, bringing back the operating temperature for complex systems on chip and the supply voltage for battery operated devices to normal operating values will be quite a challenge. These options are basically used to restore the second region of disfunctioning performance parameters to the normal (allowed) region (Figure 5.5). Similarly, the third region of permanent failure can also be restored to the normal region of performance parameters by completely replacing the permanent faulty unit with a spare unit. These options are very similar to the options used in Chapter 3. However, the first option of bringing back the operating temperature and supply voltage to normal operating values is the new feature in this chapter. In fact, enhancing system dependability means enhancing its individual attributes [Avi01]. However, in this chapter and in the whole thesis the primary focus has been on three attributes; namely the reliability, the maintainability and availability. The *reliability*, being the probability as a function of time that the system will be functioning correctly at that time, can be enhanced by using the options 1, 2, and 3. The *maintainability*, being the probability as a function of time that the system will be repaired at that time if it fails to function correctly, can be enhanced by making the right choice between options 1, 2, and 3 (explained in next sections). Similarly, the *availability*, being the probability as a function of time that the system will be available for its service, can be enhanced by reducing the repair time. One approach can be to estimate the reliability of the system in advance and make the proper decisions at the right time for repair before the system fails or moves to the "Not Allowed" region of specifications. Anticipating repair actions in advance will reduce the repair time and hence the availability of the system will increase as discussed in section 4.7.2. (110) Figure 5.6: Proposed new hardware architecture for enhancing the dependability of analog and mixed signal system on chip. #### 5.5 DEPENDABLE HARDWARE ARCHITECTURE A similar type of hardware architecture as described in Chapter 4 has been proposed here to enhance the dependability of a system on chip as shown in Figure 5.6. It also consists of redundant sub-IPs $(SIP_{1(A,B.C)}...SIP_{4(A,B.C)})$ and a number of monitoring, tuning and decision-making circuits. The difference lies in the monitoring part. In the hardware architecture shown in Figure 5.6 there are two more monitoring circuits as compared to the previous one. The on-chip temperature and supply-voltage monitoring circuits, which were not present in the previously proposed hardware architecture of Chapter 4, can be used to monitor the overall temperature and supplyvoltage of the whole chip or it can be used to monitor the temperature and supplyvoltage of each individual sub-IP as shown by the vertical dotted lines in Figure 5.6. Similar to the previously proposed hardware architecture, performance parameter monitoring circuits are also present to monitor the most sensitive performance parameter(s) to short-term and long-term effects. The degradation of the most sensitive performance parameter(s), acquired either by direct or by indirect means as discussed in Chapter 4, can be used to estimate the reliability of the whole system or the reliability of each sub-IP. These three monitoring circuits then further communicate with the "Decision Making and Tuning Circuitry", which then takes the necessary repair actions as shown by the horizontal dotted lines in Figure 5.6. Figure 5.7 shows the workflow of the proposed new dependable hardware architecture. This proposed workflow is also similar to the workflow proposed in Chapter 4. The difference lies again in the operational-temperature and supply-voltage monitoring and control regions. The working principle of this workflow is explained below. Figure 5.7: Workflow of the new proposed dependable hardware architecture (Figure 5.6) for achieving high system dependability. #### 5.5.1 Principle of Workflow In order to understand the working principle of the proposed workflow for the proposed hardware architecture, let us consider an exemplary system where an arbitrary performance parameter 'P' has been considered to be the most sensitive to short-term and long-term effects. This performance parameter 'P' will be further monitored for its variations due to short-term (VT) and long-term (NBTI aging) effects as shown in Figure 5.8. The dotted line shows the variations in 'P' due to VT effects and the solid line shows the variations due to the NBTI effect. During the regular testing mode (Figure 5.7), first the performance parameter 'P' will be checked against its system specifications (1 in Figure 5.7). It will be checked that either the performance parameter 'P' is within the specified limits/boundaries of the system specification or not. The system specifications stored in the database (11 in Figure 5.7) are used here for comparison purposes. If the performance parameter 'P' is within the defined system specifications then the current value of performance parameter 'P' along with the corresponding operating-temperature and supply-voltage values will be logged and stored in the system database (2 and 3 in Figure 5.7). These values are subsequently used to estimate the reliability of the current state of the system, as described in Chapter 4 (4 in Figure 5.7). This can be further used to predict in advance a possible hazard if any (8 in Figure 5.7). Similarly, in case 'P' is out of the defined system specifications stored in the database, proper actions as discussed in section 5.4 will be taken (5-10 in)Figure 5.7). (112) Figure 5.8: An exemplary system where an arbitrary performance parameter 'P' is being monitored for its variations due to VT and NBTI effects. The points in time ' $t_1 - t_{10}$ ' show VT adjustment points, $T_1 - T_3$ show digitally tuning points and $T_4$ show the replacement point with a fresh unit. It is clear from Figure 5.8 that at points in time $t_1 \dots t_5$ the VT variations try to move the performance parameter 'P' beyond its defined performance margin. At each of these points in time ( $t_1 \dots t_5$ , Figure 5.8) the VT values are brought back (option '1' in section 5.4) to their normal values and the performance parameter 'P' returns to its allowed region of specification. For example, this can be done by introducing cooling techniques, switching off unnecessary parts, regulating or reconfiguring the power supply of each sub-IP if possible. On the other hand at each of these points in time ( $t_1 \dots t_5$ ), Figure 5.8) the variations due to the NBTI effect continue accumulating and at time point $T_1$ these variations move the performance parameter P beyond its performance margin. Therefore, at point in time $T_1$ the digital tuning capabilities (option '2' in section 5.4) of the system are used to bring the performance parameter 'P' back to its normal region of system specifications. Similarly, at points in time ' $t_6 \dots t_{10}$ ' and $T_2 \dots T_3$ (Figure 5.8) the performance parameter P is also brought back to its normal system specification using the above mentioned techniques (options '1' and '2' in section 5.4). Most importantly, at point in time $T_4$ the system repairing capabilities to move the performance parameter 'P' back to its normal specifications are not feasible anymore. It means that there are no more digital tuning and VT adjustment options available in the system to bring the performance parameter 'P' back to its normal system specifications. Therefore, at this point in time $T_4$ the system will be replaced (option '3' in section 5.4) by a spare unit and it will continue correct functioning again. Although, in this example the degradation in performance parameter 'P' has been shown unidirectional (i.e. increasing), the degradation in performance parameter 'P' can be in either direction (increasing or decreasing) as discussed in section 5.6.3. #### 5.5.2 PROS AND CONS OF PROPOSED APPROACH The proposed approach of dealing with the problems has a number of benefits. Adjusting temperature and supply voltage, on one side, will bring the system back to its normal operating conditions and on the other side it will lower the gradual effect of aging phenomena and hence slowing down its negative effects. Similarly, the digital tuning and replacement options will be solely used for adjusting long-term aging (NBTI etc.) and permanent failure effects respectively. This will certainly increase the reliability of the system as discussed in section 5.4. Estimating the reliability in advance will make it easier to anticipate the possible digital solutions (digital tuning and replacement options) and hence lowering the repair time. Here, the reliability can be estimated by directly or indirectly monitoring the degradation in the most critical performance parameter(s) as discussed in Chapter 4. Furthermore, by lowering the repair time the maintainability of the system will be increased. This decrease in repair time will further increase the availability of the system to perform its operation as discussed in section 4.7.2. Hence by increasing reliability, availability, and maintainability, the dependability of the system will be increased. The price paid in this case will be in terms of area overhead and complexity. The temperature, supply voltage and performance monitoring circuits will require extra area. Similarly, the database used to store the design-level system specifications and the runtime measurements will require memory elements and hence extra area. However in spite of the area overhead, the improvements in terms of longer life-time and enhanced system dependability could make it still an acceptable solution for especially critical systems. #### 5.6 SIMULATIONS AND RESULTS The validity and feasibility of the proposed workflow and hardware architecture has been investigated by modelling a target system in the LabVIEW environment as explained below. #### 5.6.1 TARGET SYSTEM In order to simplify things, the proposed idea is verified based on the simulation results of a single sub-IP $(SIP_{1A})$ of the proposed hardware architecture (Figure 5.6) and is compared against a similar single sub-IP system simulated without the proposed idea. This means that one sub-IP has the proper hardware architecture as explained in Figure 5.6 whereas the other sub-IP has no capabilities of repairing the performance. Furthermore for the sake of simplicity, the 'gain' parameter of the sub-IP ' $SIP_{1A}$ ' has been assumed to be the most sensitive performance parameter to VT and NBTI variations. Table 5.1 shows the design-level specifications of ' $SIP_{1A}$ ' and the possible VT variations that could be expected in the working environment of the system. CΗ Α P Τ E R 5 Table 5.1: Designed specifications of SIP<sub>1A</sub> and the possible VT variations in the system working | Name | | Value | |-------------------------------------------------------------------|---------------|-----------------| | Designed gain of SIP <sub>1A</sub> | $G_{P-SIP1A}$ | 20dB | | Designed (allowed) supply voltage variations of SIP <sub>1A</sub> | $V_{D-SIP1A}$ | [1.175V-1.225V] | | Designed (allowed) temperature variations of $SIP_{1A}$ | $T_{D-SIP1A}$ | [-25°C - 75°C] | | Designed (allowed) gain variations of SIP <sub>1A</sub> | $G_{D-SIP1A}$ | [19dB - 21dB] | | Possible working supply voltage variations | $V_{P-SIP1A}$ | [1.1V - 1.3V] | | Possible working temperature variations | $T_{P-SIP1A}$ | [0°C - 125°C] | Figure 5.9: Possible gain variations of $SIP_{1A}$ due to VT variations in the system working environment. #### 5.6.2 THE SIMULATION ENVIRONMENT Figure 5.9 shows the possible 'gain' ( $G_{PSIP1}$ ) parameter variations in the sub-IP $SIP_{1A}$ due to the expected VT variations in the working environment of the system. Figure 5.9 further shows that the possible gain parameter ' $G_{P-SIP1A}$ ' variations can move beyond the designed specifications $(G_{D-SIP1A})$ as a result of expected VT variations in the working environment of the system. To make it more illustrative, let ' $SIP_{1A-C}$ ' (subscript 'C' as an indicator) represent the sub-IP which is compensated by means of the proposed hardware architecture and has three digital tuning knobs for the 'gain' parameter. This means ' $SIP_{1A-C}$ ' has eight possible digital tuning options for adjusting the 'gain' parameter. Similarly, let ' $SIP_{1A-NC}$ ' represent the sub-IP which is neither compensated by means of the proposed hardware architecture nor has any type of digital tuning capabilities. Furthermore, two random number generators (uncorrelated) have been used in order to generate the expected randomly changing supply-voltage and operational-temperature variations in the actual working environment of the system. The next section will discuss the simulation results of the sub-IP 'SIP1A' with and without the proposed hardware architecture and workflow as shown in Figure 5.10. #### 5.6.3 SIMULATION RESULTS OF THE TARGET SYSTEM The four sub-graphs in Figure 5.10 represent the four monitors $(M_1 ... M_4)$ . Monitor 'M<sub>4</sub>' shows the change in 'gain' parameter due to the aging (NBTI) effects only. It further shows the results of the decision making and tuning circuitry as well. The other Figure 5.10: Simulation results over 43523 hours ( $\sim$ 5 years) of a single sub-IP ( $SIP_1$ ) of the system in Figure 5.6 with and without the proposed architecture (Figure 5.6) and workflow (Figure 5.7). C H A P T E R three monitors are used as the temperature monitor $(M_1)$ , the supply-voltage monitor $(M_2)$ , and the performance (the 'gain' parameter in current case) monitor $(M_3)$ respectively. The vertical red lines in the monitor $M_4$ show the points in time when the operational-temperature and supply-voltage values of $SIP_{1A-C}$ are adjusted back (option 1 in section 5.4) to normal values (i.e. 25°C and 1.2V respectively) if the 'gain' parameter moves beyond the designed specifications ([19dB - 21dB]). These adjustments in the operational-temperature, supply-voltage and the resulting value of 'gain' parameter of 'SIP<sub>1A-C</sub>' are shown by the white lines in monitors 'M1', 'M2', and 'M3' respectively. The parameters ' $T_1 \dots T_8$ ' show the points in time when the adjustments in the 'gain' parameter of ' $SIP_{1A-C}$ ' are no more possible by means of adjusting the operational-temperature and supply-voltage values back to their normal values. At these points in time ( $T_1 \dots T_8$ ) the aging (NBTI) effect is responsible to move the 'gain' parameter of ' $SIP_{1A-C}$ ' beyond its designed specifications ([19dB -21dB]). Therefore, at these points in time ( $T_1 \dots T_8$ ) the digital tuning capabilities (option 2 in section 5.4) of ' $SIP_{1A-C}$ ' are used to move the 'gain' parameter back to its normal designed specification (i.e. within [19dB - 21dB]). Similarly, the 'gain' parameter values of ' $SIP_{1A-NC}$ ' and the corresponding unadjusted variations in the operational-temperature and supply-voltage values are shown by the red line in monitors 'M1', 'M2', and 'M3' respectively. It is clear from these monitor values that because no adjustments are made in the operational-temperature and supply-voltage values at several points in time (' $t_1 \dots t_{15}$ ') the 'gain' parameter of ' $SIP_{1A-NC}$ ' moves out of its specification boundaries either for a short duration or a long duration. The length of the duration, for which the 'gain' parameter stays outside the specification boundaries, further depends on the corresponding length/duration of VT and aging (NBTI) variations. #### 5.6.4 COMPARISON OF THE SIMULATION RESULTS The monitor $M_3$ in Figure 5.10 shows the comparison between the 'gain' parameters of ' $SIP_{1A-C}$ ' (white line) and ' $SIP_{1A-NC}$ ' (red line). This comparison shows that due to the use of the proposed hardware architecture and workflow, the sub-IP ' $SIP_{1A-C}$ ' can be used for a longer time as compared to sub-IP ' $SIP_{1A-C}$ ' which has no means of performance compensation. The sub-IP 'SIP<sub>1A-C</sub>' will remain functional until the point in time ' $R_1$ '. This is the point in time when the sub-IP ' $SIP_{1A-C}$ ' needs to be completely replaced (option 3 in section 5.4) with a redundant sub-IP. This complete replacement with a redundant sub-IP happens at a point in time ' $R_1$ ' (Figure 5.10) as a result of running out of digital tuning options for digitally tuning the 'gain' parameter of the sub-IP ' $SIP_{1A-C}$ ' back to its normal specifications. On the other hand, the sub-IP 'SIP<sub>1A-NC</sub>' needs a complete replacement with a redundant sub-IP at each point in time $T_1 \dots T_8$ because of the unavailability of repairing mechanisms. The worst case is, as can be seen by the red lines of the monitor $M_3$ in Figure 5.10, is that the sub-IP $'SIP_{1A-NC}'$ will start functioning out of the specification limits/boundaries at several points in time (' $t_1 \dots t_{15}$ '). It means, at all of these points in time (' $t_1 \dots t_{15}$ ') the sub-IP becomes unreliable which is highly undesirable. Furthermore, the presented idea can be extended and simulated for other sub-IPs as well. The simulation results and comparison shown in Figure 5.10 gives confidence that by using the proposed hardware architecture and workflow the dependability of the whole system can be enhanced. It should be noted that these simulations results are based on high-level abstract models that do not include the detailed implementation complexities and fabrication-related process variations as well as the corresponding area overheads. Similarly, due to a lack of system-level dependability enhancement techniques especially for analog and mixed-signal systems, according to our best knowledge, the current technique cannot be compared to any other techniques. In addition, the exact implementation complexities and the corresponding overheads currently could not be calculated due to time limitations. However, the simulation results are promising and encouraging. ### 118 #### 5.7 CONCLUSIONS In this chapter, the importance of differentiating between short-term and long-term dependability issues has been discussed. It has been shown by simulations that by separating the short-term and long-term dependability issues and addressing them separately, the dependability of the electronic systems can be further enhanced during the operational life of a system. Reliability, being one of the important attributes of dependability, is influenced on short-term (non-permanently) by operating temperature, supply voltage and on long-term (permanently) by a number of aging phenomena. Therefore, by separating them as short-term and long-term effects, proper actions can be taken in order to enhance the dependability of the system. This has been achieved in a proposed hardware architecture and workflow which regularly monitors and stores the operating-temperature and supply-voltage values along with the value of the most sensitive performance parameter(s) of the system in a logged database. It then estimates the reliability of the system based on these stored values. These estimations are subsequently used in anticipating the system performance in advance which is further used to take the necessary actions either by adjusting back the operational-temperature and supply-voltage variations to their nominal values; or by digitally tuning and subsequently replacing the faulty sub-IPs with redundant sub-IPs. These actions will further enhance the overall dependability of the system. Furthermore, the simulation results for a dummy target system, modelled in the LabVIEW environment, under randomly changing supply-voltage and working-temperature variations show that the proposed technique for enhancing the system dependability during the operational life is a valid technique. #### 5.8 REFERENCES [Ali06] M. Alioto, and G. Palumbo, "Impact of Supply Voltage Variations on Full Adder Delay: Analysis and Comparison," in IEEE Tran. on Very Large Scale Integration (VLSI) Systems, Vol. 14, No. 12, pp. 1322-1335, 2006. [Avi01] A. Avizienis, J-C. Laprie, and B.Randell, "Fundamental concepts of dependability," in Laboratory for Analysis and Architecture of Systems (LAAS-CNRS) Technical Report no. 01-145, Apr. 2001. [Bas09] A. Baschirotto, et al., "Low Power Analog Design in Scaled Technologies," in Topical Workshop on Electronics for Particle Physics (TWEPP), pp.103-110, 2009. [Cru09] S.C. de la Cruz, et al., "Design and implementation of operational amplifiers with programmable characteristics in a 90nm CMOS process," in IEEE Eur. Conf. Circuit Theory and Design (ECCTD), pp. 209-212, 2009. [Jha05] N.K. Jha, P.S. Reddy, D.K. Sharma, and V.R. Rao, "NBTI degradation and its impact for analog circuit reliability," in IEEE Tran. Electron Devices, Vol. 52, No. 12, pp. 2609-2615, 2005. [Joh04] R. W. Johnson et al., "The Changing Automotive Environment: High Temperature Electronics," in IEEE Tran. on Electronics Packaging Manufacturing, Vol. 27, No. 3, pp. 164-176, 2004. [Kha13] M.A. Khan, and H.G. Kerkhoff, "Monitoring Operating Temperature and Supply Voltage in Achieving High System Dependability," in IEEE Int. Conf. Design & Technology of Integrated Systems (DTIS), pp. 112-116, 2013. [Kri10] S. K. Krishnappa, H. Singh, and H. Mahmoodi, "Incorporating Effects of Process, Voltage, and Temperature Variation in BTI Model for Circuit Design," in IEEE Latin American Symposium on Circuits and Systems (LASCAS), pp. 236-239, 2010. [Kum06] R. Kumar, and V. Kursun, "Impact of temperature fluctuations on circuit characteristics in 180nm and 65nm CMOS technologies," in IEEE Int. Symp. on Circuits and Systems (ISCAS), pp. 3858-3861, 2006. [Lat11] M.A.A. Latif, N.B.Z. Ali, and F.A. Hussin, "A case study of process-variation effect to SoC analog circuits," in IEEE Int. Conf. Recent Advances in Intelligent Computational Systems (RAICS), pp. 520-523, 2011. [Lew09] L.L. Lewyn, T. Ytterdal, C. Wulff, and K. Martin, "Analog Circuit Design in Nanoscale CMOS Technologies," in Proceedings of the IEEE, Vol. 97, No. 10, pp. 1687–1714, 2009. [Lu09] Y. Lu, et al., "Statistical reliability analysis under process variation and aging effects," in IEEE Int. Design Automation Conference (DAC), pp. 514-519, 2009. [Mar11] E. Maricau, et al., "A compact NBTI model for accurate analog integrated circuit reliability simulation," in Proceedings of the European Solid-State Device Research Conference (ESSDERC), pp. 147-150, 2011. [Pau05] B.C. Paul, K. Kang, H. Kufluoglu, M.A. Alam, and K. Roy, "Impact of NBTI on the temporal performance degradation of digital circuits," in IEEE Electron Device Letters, Vol. 26, No. 8, pp. 560- 562, 2005. [Sch03] D.K. Schroder and J.A. Babcock, "Negative bias temperature instability: Road to cross in deep submicron silicon semiconductor manufacturing," in Journal of Applied Physics, Vol. 94, No. 1, pp. 1–18, 2003. [Uns06] O.S. Unsal, et al., "Impact of Parameter Variations on Circuits and Microarchitecture," in IEEE Micro, Vol. 26, No. 6, pp. 30-39, 2006. [Wan08] W. Wang, et al., "Statistical prediction of circuit aging under process variations," in IEEE Int. Custom Integrated Circuits Conference (CICC), pp. 13-16, 2008. # PERFORMANCE DEGRADATION ANALYSIS AND DEPENDABILITY ENHANCEMENT OF SAR ADCS ABSTRACT — It has been shown in the previous chapters that in order to improve the runtime (meaning during operational life) dependability, one first has to investigate the critical performance parameters with respect to aging effects. The degradation information of the most critical performance parameter(s), monitored directly or indirectly, can then be further used to take the necessary actions for improving the system dependability during its operational life. This chapter will deal in detail on these issues by analysing the degradation effects in the performance parameters of a sample mixed-signal system, a charge-redistribution successive approximation register (SAR) ADC. This device will be further analysed to propose possible dependability improvement strategies. The first challenge is that conducting transistor-level aging simulations in complex analog and mixed-signal systems like analog-to-digital converters is often very time consuming. Therefore, in this chapter we will first investigate the degradation effects in the performance parameters of a chargeredistribution successive approximation register (SAR) ADC by using a system-level approach. In this approach the whole system has been divided into its sub-building blocks and the degradation effects of each individual sub-building block have been incorporated into its system-level models. Furthermore, these models have been simulated in the LabVIEW environment in order to investigate aging effects in static and dynamic performance parameters of a charge-redistribution SAR ADC. The results of these simulations have been further used to find the most critical performance parameter(s) to monitor the SAR ADC performance parameter degradations during operational life. These selected parameters are discussed in detail to find strategies to mitigate the performance degradation in the SAR ADC. This will result in dependability enhancement strategies for the SAR ADC during operational life. ### 6.1 Introduction In chapters 3, 4, and 5 it has been shown that monitoring the degradation effects in potential critical performance parameter(s), either by direct or indirect means, is crucial for dependable design. In these chapters it has also been indicated that system-level performance parameters can be used to estimate the reliability during the operational life of a system. Furthermore, the selection of potential critical system-level performance parameter(s) is application dependent and can be divided into different categories based on their sensitivity to aging effects and process variations. The most sensitive system-level performance parameter can be acquired via aging simulations at the design time and can be predicted as the best indicator for degradation effects. Therefore, electronic designs developed in aging-critical technology nodes require aging simulations in advance for obtaining a better dependable design. In literature, simulation techniques are available to investigate aging effects based on the device-level (transistor) models [Mar11, Kri10, Zha08]. Mostly, efforts have been made either for a single device (MOS transistor) or relatively simple circuits like inverter, ring oscillator, or amplifiers [Wan11]. Usually, transistor-level aging simulators, based on device-level (transistor) models [Mar11, Bao09], are frequently used to simulate aging effects in these smaller and simple circuits. However, due to the long simulation times and involved complexity they are not suitable for complex analog and mixed-signal systems. This requires a different strategy for these complex systems that should be fast and sufficiently efficient while still based on device-level aging models. Therefore, behavioural models which are frequently used in electronics for studying performances of systems to accomplish a high simulation speed can also be used to simulate aging-related degradation effects in these electronic systems. One possible way to simulate aging effects in larger systems is to sub-divide the system into smaller sub-building blocks. The idea is to simulate aging effects for each individual sub-building block or at least for aging-critical parts using the transistor-level aging simulators and use this degradation information in the system-level models. In this way degradation effects in the performance parameters of the whole system can be investigated. Another possibility, which has been used in this chapter, is to analyse the performance of the whole system by using a potentially possible set of degraded values. This set can be acquired via transistor-level aging simulators and can be used in the behavioural models of each sub-building block. In this chapter, the focus is to analyse the degradation effects in the performance parameters of a relatively simple typical mixed-signal circuit, a charge-redistribution SAR ADC, based on the degradation models of its sub-building blocks. These successive approximation register (SAR) analog-to-digital converters (ADCs) are among the widely used ADCs in electronic industry [Hae10]. This is due to the relatively simple architecture and simple technology requirements for integrating capacitors in standard CMOS processes. The emphasis will be to analyse the static and dynamic performance degradation as a result of the buffer, comparator and DAC capacitor-array degradation. For simplicity, the rest of the sub-building blocks will be treated as ideal components (i.e. no aging) which have no influence on the presented methodology. The static and dynamic performance parameters that will be analysed as part of the performance analysis include offset, gain, dynamic nonlinearity error (DNLE), integral nonlinearity error (INLE), signal-to-noise and distortion (SINAD), total harmonic distortion (THD), and effective number of bits (ENOB). Based on the observed degradation effects in the static and dynamic performance parameters it will be investigated which performance parameters are most critical to aging effects and are the best indicators with respect to degradation effects. These performance parameters will subsequently be used to investigate a number of potential dependability enhancement strategies. The rest of the chapter is organized as follows. Section 6.2 will describe the building blocks of a charge-redistribution SAR ADC, its operating principle, and the important mathematical formulations used to model the degradation effects in this 122 ADC. The mathematical equations used to analyse the static and dynamic performance parameter degradations are explained in section 6.3. The architectural composition of the simulation setup modelled in a LabVIEW environment and the corresponding simulation results are discussed in sections 6.4 and 6.5 respectively. The potential critical performance parameters to be used as the best indicators for performance degradations are provided in section 6.6. The use of mitigation techniques for these degraded performance parameters in the proposed dependability enhancement strategies is being discussed in section 6.7. The summary and some important references are presented in sections 6.8 and 6.9 respectively. #### 6.2 THE CHARGE REDISTRIBUTION SAR ADC Among available architectures for SAR ADCs, the charge-redistribution (CR) based successive approximation is the most widely used architecture due to its medium digital output resolution, relatively simple architecture and simple control scheme [Hae10, Pei10]. Figure 6.1 shows the simplified circuit diagram of this ADC. It consists of a switching mechanism for analog input and reference voltages, an input buffer circuit, a digital-to-analog converter (DAC), a comparator and SAR control logic. The input buffer drives the DAC with the analog input voltage ( $V_{IN}$ ) and the reference voltage ( $V_{REF}$ ) during different switching phases. Similarly, Figure 6.2 shows the binary-weighted capacitor array for an N-bit DAC architecture, which consists of N capacitors and one dummy capacitor of capacitance 'C' for an N-bit ADC. #### 6.2.1 THE WORKING PRINCIPLE OF THE ADC The working principle of SAR ADC to convert the analog input signal to the digital output signal is performed in three phases; the sampling phase, the hold phase and the redistribution phase in which the actual conversion takes place [Kug00]. During the *sampling phase*, the switches $S_{SAMPLE}^{1}$ , $S_{SAMPLE}^{2}$ and $S_{D}^{1}$ , $S_{0}^{1}$ , ...., $S_{N-1}^{1}$ are closed and all capacitors in the capacitor array sample the input voltage $V_{IN}$ (Figure 6.2). While during the *hold phase*, the switches $S_{SAMPLE}^{1}$ , $S_{SAMPLE}^{2}$ , and $S_{D}^{1}$ , $S_{0}^{1}$ , ...., $S_{N-1}^{1}$ are opened and all the switches $S_{D}^{0}$ , $S_{0}^{0}$ , ...., $S_{N-1}^{0}$ are closed, thereby providing a voltage $V_{C} = -V_{IN}$ to the comparator input. This shows that the present ADC architecture has a built-in sample-and-hold mechanism. The actual conversion is performed during the *redistribution phase*, which takes N-conversion steps for an N-bit ADC. During this whole phase, the switch $S_{REF}$ remains closed (Figure 6.1). In the first conversion step, all the switches $S_D^{\ 1}$ , $S_0^{\ 1}$ , ...., $S_{N-2}^{\ 1}$ and $S_{N-1}^{\ 0}$ are opened and the switches $S_D^{\ 0}$ , $S_0^{\ 0}$ , ...., $S_{N-2}^{\ 0}$ and $S_{N-1}^{\ 1}$ are closed, thereby connecting $C_N$ to reference voltage $V_{REF}$ . This corresponds to the full-scalerange (FSR) of the ADC (Figure 6.2). The most left-hand side capacitor $C_N$ forms a 1:1 capacitor ratio with the remaining capacitors in the capacitor array connected to ground via the switches $S_D^{\ 0}$ , $S_0^{\ 0}$ , ...., $S_{N-2}^{\ 0}$ respectively. At this stage the comparator input voltage becomes $V_C = -V_{IN} + V_{REF}/2$ . If $V_{IN} > V_{REF}$ (i.e. $V_C < 0$ ), the comparator output goes high and the switch $S_{N-1}^{\ 1}$ remains closed providing the most significant bit of the digital output code ( $D_{OUT}$ ) set to one (MSB = 1). On the other hand, if Figure 6.1: The simplified circuit diagram of a charge redistribution SAR ADC. Figure 6.2: DAC capacitor array for an N-bit charge redistribution SAR ADC $V_{IN} < V_{REF}$ (i.e. $V_C > 0$ ), the comparator output goes low and the switch $S_{N-1}^{-1}$ is opened and switch $S_{N-1}^{-0}$ is closed to discharge $C_N$ providing the most significant bit of the digital output code $(D_{OUT})$ set to zero (MSB = 0). This process continues until the least significant bit (LSB) of the digital output code $(D_{OUT})$ is determined. The ideal output voltage of the DAC capacitor array after N-steps is given by [Hae10]: $$V_{C,ideal} = -V_{IN} + \frac{C_H}{C_H + C_L} \cdot V_{REF}$$ (6.1) where $C_H = \sum_K C_K$ for K such that $S_K^{\ 1}$ is closed and $C_L = \sum_K C_K$ for K such that $S_K^{\ 0}$ is closed. As the switch $S_D^{\ 0}$ will always be closed during the redistribution phase, the most right-hand side dummy capacitor $C_0$ will always be included in $C_L$ . Each bit of the digital output $D_{OUT}$ is hence determined by the values of $S_K^{\ 1}$ and $S_K^{\ 0}$ respectively. The bit-value is '1' if $S_K^{\ 1}$ is closed and the bit-value is '0' if $S_K^{\ 0}$ is closed. #### 6.2.2 MODELLING DEGRADATION EFFECTS IN THE SAR ADC In order to study the degradation effects for different performance parameters of the ADC, at first, it has been divided into different sub-building blocks including switches, buffer, capacitor array for DAC, comparator and SAR control logic. All these sub-building blocks have been modelled as ideal components in LabVIEW. Afterwards, the aging effects have been incorporated in these ideal models for the different sub-building [124] E R 6 125 blocks. Based on the architecture shown in Figure 6.1, the different voltages at nodes A, B, and C can be written as: $$V_A = V_{IN}$$ if $S_{SAMPLE}^1 = closed \& S_{REF} = open$ = $V_{REF}$ if $S_{REF} = closed \& S_{SAMPLE}^1 = open$ (6.2) $$V_B = V_A = V_{IN}$$ if $S_{SAMPLE}^{1} = closed \& S_{REF} = open$ = $V_{REF}$ if $S_{REF} = closed \& S_{SAMPLE}^{1} = open$ (6.3) $$V_{C} = -V_{B} = -V_{IN} \text{ during hold phase}$$ $$if S_{SAMPLE}^{1}, S_{D}^{0}, S_{0}^{0}, \dots, S_{N-1}^{0} = closed$$ $$and S_{SAMPLE}^{2}, S_{D}^{1}, S_{0}^{1}, \dots, S_{N-1}^{1} = open$$ (6.4) As the hold phase is followed by the redistribution phase, therefore the voltage $V_C$ after N-conversion steps will be the sum of voltages in the hold phase and the redistribution phase as explained in the previous section. This means that after these three phases the voltage ' $V_c$ ' will be given by equation (6.1). The next step is to incorporate the aging effects in these equations. For this, first the degradation effects of each individual sub-building block have to be investigated. This can be done by using the actual transistor-level circuits in combination with their device-level aging models [Mar11, Bao09]. Alternatively, a range of appropriate degradation values can be used for each sub-building block. This is what has been used in this chapter where results have been extracted based on transistor-level circuits and NBTI aging models used in [Wan11] for the 65nm TSMC technology node. According to the architecture in Figure 6.1, different sub-building blocks of the ADC could have aging effects under different aging mechanisms. However, for simplicity, only NBTI has been considered as the dominant degradation mechanism. Among the different sub-building blocks of the ADC only the buffer, the comparator and the DAC capacitor array are considered more prone to aging effects. As experience has shown, the digital parts may introduce additional delays as a result of aging effects; however these effects will be ignored. Therefore, the rest of the sub-building blocks, including switches and SAR control logic, have been modelled as ideal components. #### 6.2.2.1 MODELLING THE BUFFER AND COMPARATOR DEGRADATION EFFECTS As a result of aging, transistor parameters in the buffer and the comparator circuits are expected to change and hence could change their performance parameters. Among the different degraded performance parameters, the offset and the gain parameter are the most influenced performance parameters by aging effects in both buffer and comparator circuits [Wan11]. But on the other hand, the closed-loop configuration of the buffer and the voltage comparing nature of the comparator will make both less sensitive to gain degradation effects. Therefore, only offset degradation in the buffer and the comparator have been considered to be crucial and will be incorporated in the above equations. Hence, if $V_{(OS,BUFF)}$ and $V_{(OS,COMP)}$ represent the offset voltages of the buffer and the comparator respectively then the above equations for $V_B$ and $V_C$ can be rewritten as: $$V_B = V_A + V_{OS,BUFF} = V_{IN} + V_{OS,BUFF}$$ if $S_{SAMPLE}^{1} = closed \& S_{REF} = open$ (6.5) = $V_{REF} + V_{OS,BUFF}$ if $S_{REF} = closed \& S_{SAMPLE}^{1} = open$ $$V_{C} = -V_{B} = -(V_{IN} + V_{OS,BUFF} + V_{OS,COMP}) \ during \ hold \ phase$$ $$if \ S_{SAMPLE}^{1}, S_{D}^{0}, S_{0}^{0}, \dots, S_{N-1}^{0} = closed$$ $$and \ S_{SAMPLE}^{2}, S_{D}^{1}, S_{0}^{1}, \dots, S_{N-1}^{1} = open$$ (6.6) Similarly, after N-steps during the redistribution phase, ' $V_C$ ' can be rewritten as: $$V_{C} = -(V_{IN} + V_{OS,BUFF} + V_{OS,COMP}) + \frac{C_{H}}{C_{H} + C_{L}} (V_{REF} + V_{OS,BUFF})$$ (6.7) This equation shows that the comparator offset voltage will be added only to the analog input voltage $(V_{IN})$ . Whereas, the buffer offset voltage will be added to both the analog input $(V_{IN})$ as well as reference voltage $(V_{REF})$ . In other words, the degradation in the buffer and the comparator offset voltages will change the voltage value at the comparator input and hence the digital output of the ADC could change. These degradations in the buffer and the comparator offset voltages have been incorporated in the buffer and comparator behavioural models. The next section will discuss how DAC capacitor-array degradation effects can be incorporated in equation (6.7). #### 6.2.2.2 MODELLING THE DAC CAPACITOR-ARRAY DEGRADATION EFFECTS After modelling the buffer and comparator degradation effects, the next step is to model the DAC capacitor-array degradation effects. Metal-Insulator-Metal (MIM) capacitors, being widely used in A/D and D/A converters, degrade as a function of input stress voltage, working stress temperature, and stress time. This degradation could change its behaviour (increasing or decreasing) after a certain amount of time based on the stress conditions [Hot09, Sed11, Chi07, Wu10]. For example, in case of SiO<sub>2</sub> MIM-capacitors, the capacitance increases at a constant stress for a certain period of stress time and starts decreasing after that period of stress time with the same stress conditions. This reversal in MIM-capacitance degradation behaviour begins earlier at elevated stress temperatures [Chi07]. This means, the capacitance of MIM capacitors is a complex function of input stress voltage 'V<sub>STRESS</sub>', working stress temperature 'T<sub>STRESS</sub>' and the corresponding stress time 't'. This can be written as: $$\Delta C_{MIM} = f(V_{STRESS}, T_{STRESS}, t) \tag{6.8}$$ Let's assume that the input stress voltage ' $V_{IN}$ ' and the working stress temperature ' $T_{STRESS}$ ' are randomly changing over a specific period of time (e.g. 20 years). In this situation the degradation in each capacitor ' $C_N$ ' of the DAC capacitor array will depend on the switching activity of its associated switches $S_{N-1}^{-1}$ and $S_{N-1}^{-0}$ (Figure 6.2), the stress time 't', the input stress voltages ' $V_{IN}$ ' and ' $V_{REF}$ ' and the stress temperature ' $T_{STRESS}$ '. This means: $$\Delta C_N = f(S_{N-1}^{1}|_{OFF}^{ON}, S_{N-1}^{0}|_{OFF}^{ON}, V_{IN}, V_{REF}, T_{STRESS}, t)$$ (6.9) Furthermore, the switching activity in the associated switches $S_{N-1}^{1}$ and $S_{N-1}^{0}$ for each capacitor ' $C_N$ ' will further depend on the input voltage ' $V_{IN}$ ' that can have any **random** value in between zero voltage and the full-scale (FS) voltage values of the (126) ADC (Figure 6.1). This means, for each capacitor ${}^{\circ}C_N{}^{\circ}$ the input stress voltage will fluctuate randomly between ${}^{\circ}V_{IN}{}^{\circ}$ , ${}^{\circ}V_{REF}{}^{\circ}$ and the ground terminal with ${}^{\circ}V_{IN}{}^{\circ}$ being another random value as described above. During the random ON state the capacitor ${}^{\circ}C_N{}^{\circ}$ is connected to the ${}^{\circ}V_B{}^{\circ}$ terminal and hence $V_{STRESS} = V_{IN}$ or $V_{REF}$ . During the random OFF state the capacitor ${}^{\circ}C_N{}^{\circ}$ is connected to the ground terminal and hence $V_{STRESS} = 0V$ . This makes the input stress voltage ${}^{\circ}V_{STRESS}{}^{\circ}$ for capacitor ${}^{\circ}C_N{}^{\circ}$ a random process over the stress time ${}^{\circ}t{}^{\circ}$ . Similarly, if the stress temperature $T_{STRESS}$ is also randomly changing over the stress time ${}^{\circ}t{}^{\circ}$ (e.g. 20 years) then the change or degradation in the capacitance of capacitor ${}^{\circ}C_N{}^{\circ}$ will also have a random behaviour. This means at some random points in time the capacitance will increase whereas at other random points in time the capacitance of capacitor ${}^{\circ}C_N{}^{\circ}$ will decrease [Chi07]. As a conclusion, the degradation in each capacitor ' $C_N$ ' as a result of aging, can be randomly increasing or decreasing and the corresponding degradation in the DAC capacitor array will result in a capacitor array with random variations at random points of time during the stress time 't'. In other words, a randomly increasing or decreasing degradation in each capacitor ' $C_N$ ', as a result of aging, will result in a randomly degraded ADC, having a different DAC capacitor array, at random points of time during the stress time 't'. The randomly changing input voltage ' $V_{IN}$ ' will decide the random ON or OFF activity of the switches and hence the random input stress voltage ' $V_{STRESS}$ ' for each capacitor ' $C_N$ ' during the stress time 't'. Similarly, the random working stress-temperature ' $T_{STRESS}$ ' along with the above stress conditions will decide the random increase or decrease in the capacitance of each capacitor ' $C_N$ ' during the stress time 't'. The random ON or OFF activity will also decide the random stress time during which the degradation occurs and hence will produce a random amount of degradation in each capacitor ' $C_N$ '. This can be modelled for each capacitor as a random degradation of its ideal value. Therefore, the value of each capacitor ' $C_N$ ' in Figure 6.2 can be rewritten as: $$C_{N-New} = [S * (P * R_1)/100] * C_N + C_N$$ (6.10) where $$S = 1$$ if $R_2 \ge 0.5$ = -1 if $R_2 < 0.5$ (6.11) Here 'P' is the maximum possible percent degradation value that can occur during the stress time 't' (e.g. 20 years) and can be obtained from experimental data. The parameters ' $R_1$ ' and ' $R_2$ ' are two independent random number generators that generate random values between '0' and '1'. The parameter ' $R_1$ ' can be related to the random amount of degradation in the capacitance of each capacitor ' $C_N$ ' at random points of time during the stress time 't'. Similarly, the parameter ' $R_2$ ' can be related to the randomly increasing and decreasing degradation behaviour in the capacitance of each capacitor ' $C_N$ ' at random points of time during the stress time 't'. Therefore, ' $C_H$ ' and ' $C_L$ ' in equation (6.7) become: $$C_{H-New} = \sum_{K} C_{K-New}$$ for K such that $S_{K}^{1}$ is closed (6.12) and $$C_{L-New} = \sum_{K} C_{K-New}$$ for K such that $S_K^0$ is closed (6.13) Therefore, equation (6.7) can be rewritten as: $$V_{C} = -(V_{IN} + V_{OS,BUFF} + V_{OS,COMP}) + \frac{C_{H-New}}{C_{H-New} + C_{L-New}} (V_{REF} + V_{OS,BUFF})$$ (6.14) It is clear from the above equation that the new capacitor ratio will also change the voltage generated during the redistribution phase. In other words, the capacitor degradation in the DAC capacitor array and the degradation in the buffer and comparator offset voltages, as a result of aging, will change the voltage value at the comparator input and hence the digital output of the SAR ADC could change. These degradations in the DAC capacitor array as well as the buffer and the comparator offset voltages have been incorporated in the ideal models. The next two sections will discuss how the degradation as a result of aging will affect the static and dynamic performance parameters of the SAR ADC. #### 6.3 SAR ADC PERFORMANCE ANALYSIS The sine-wave histogram method is one of the most widely used methods for ADC testing [Hsi08]. It is normally used to obtain ADC *static* parameters like the offset $(V_{OS})$ , gain (G), differential nonlinearity (DNL), and integral nonlinearity (INL). Some on-chip histogram test methods have also been reported in literature [Aza00]. This makes the sine-wave histogram-testing method a suitable choice for estimating the ADC static performance parameters. Similarly, a full scale sine-wave input having a peak value of $2^{N-1}$ has also been used to estimate the N-bit ADC *dynamic* parameters like signal-to-noise and distortion (SINAD), total harmonic distortion (THD), and effective number of bits (ENOB). Following are the equations [Hsi08] that have been used in order to calculate the static and dynamic parameters of the N-bit ADC in LabVIEW. First the equations for *static* parameters are given. The offset is calculated using: $$V_{OS} = \frac{\cos[\pi H(0)/N_t] - \cos[\pi H(2^N - 1)/N_t]}{\cos[\pi H(0)/N_t] + \cos[\pi H(2^N - 1)/N_t]} (2^{N-1} - 1)$$ (6.15) Here $N_t$ denotes the total number of samples used in the histogram method whereas H(0) and $H(2^N - 1)$ represent the number of hits at the lower and the upper codes respectively. Similarly, the DNL error (DNLE) is calculated using: $$DNLE(i) = \frac{H(i)}{H_{ideal}(i)} - 1, \qquad i = 1, 2, \dots, 2^{N} - 2$$ (6.16) [128] | Confidence Level | Value for $Z_{\alpha}/2$ | Confidence Level | Value for $Z_{\alpha}/2$ | |------------------|--------------------------|------------------|--------------------------| | 70% | 1.040 | 92% | 1.750 | | 75% | 1.150 | 95% | 1.960 | | 80% | 1.280 | 96% | 2.050 | | 85% | 1.440 | 98% | 2.330 | | 90% | 1.645 | 99% | 2.576 | where $$H_{idel}(i) = \frac{N_t}{\pi} \left[ sin^{-1} \left( \frac{i + 1 - 2^{N-1} - V_{OS}}{A} \right) - sin^{-1} \left( \frac{i - 2^{N-1} - V_{OS}}{A} \right) \right]$$ (6.17) and $$A = Amplitude = \frac{2^{N-1} - 1 - V_{OS}}{\cos[\pi H(2^N - 1)/N_t]}$$ (6.18) The total number of samples used $(N_t)$ are calculated by using the following equation depending on the different levels of confidence used for the DNL error (DNLE) resolution [Atx13]: $$N_t = \frac{\pi \, 2^{N-1} (Z_\alpha/2)^2}{\beta^2} \tag{6.19}$$ Here ${}^{\prime}Z_{\alpha}/2{}^{\prime}$ represents the confidence level which has commonly accepted values as summarized in Table 6.1 [Atx13]. The ${}^{\prime}N{}^{\prime}$ denotes the ADC number of bits and ${}^{\prime}\beta{}^{\prime}$ the required DNLE resolution. Furthermore, the INL error (INLE) and gain (G) are calculated using the following equations [Hsi08]: $$INLE(i) = \sum_{k=1}^{l} DNLE(k), \qquad i = 1, 2, \dots, 2^{N} - 2$$ (6.20) $$G \approx 1 - \frac{1}{2^N - 2} \sum_{i=2}^{2^N - 2} DNLE(i)$$ (6.21) The following equations have been used to calculate the *dynamic* parameters [Gra06] being signal-to-noise and distortion (SINAD), total harmonic distortion (THD), and effective number of bits (ENOB): $$SNR = 20 \log_{10} \left[ \frac{RMS \ value \ of \ FS \ Sinewave}{RMS \ value \ of \ Ouantization \ Noise} \right]$$ (6.22) $$THD = 20 \log_{10} \sqrt{\frac{V_{f2}^2 + V_{f3}^2 + \dots + V_{fn}^2}{V_{f1}^2}}$$ (6.23) $$SINAD = -10 \log_{10} \left[ 10^{\frac{SNR}{10}} + 10^{\frac{-THD}{10}} \right]$$ (6.24) (129) C H A P T E R Figure 6.3: Block diagram of the SAR ADC performance-analysis system modelled in the LabVIEW environment. $$ENOB = [SINAD - 1.76] / 6.02$$ (6.25) Where 'FS' denotes the full-scale, and ' $V_{f1}$ ' is the amplitude of the fundamental frequency and ' $V_{f2}$ ' is the amplitude of second harmonic etc. #### 6.4 SIMULATION SETUP In order to investigate the degradation in the ADC performance parameters (static and dynamic) as a result of the degradation in its chosen sub-building blocks, a performance-analysis simulation setup has been constructed in the LabVIEW environment as shown in Figure 6.3. This performance-analysis system consists of a flexible sine-wave generator where amplitude, frequency, dc value, duration and number of samples produced can be changed by the user. The number of samples generated corresponds to the ADC number of bits, the DNLE resolution and the confidence level used as shown in equation (6.19). For the confidence level ' $Z_{\alpha}/2$ ' the commonly accepted values used are summarized in Table 6.1. The ADC is modelled in the LabVIEW environment based on the equations (6.1) - (6.14). Separate control structures are used to provide the desired reference voltage $(V_{REF})$ and possible variations in the buffer and the comparator offset $(V_{OS,BUFF}, V_{OS,COMP})$ voltages as shown in Figure 6.3. The modelled ADC has a flexible architecture where input voltage, reference voltage, number of bits for digital output $(D_{OUT})$ can be altered. This also holds for the maximum DAC capacitors percentage of error, and different confidence levels for DNLE resolution. The output of the ADC was stored in a memory which is further used to calculate the static and the dynamic performance parameters using the equations (6.15) - (6.25) as discussed in the previous section. The amplitude of the input sine-wave has been taken slightly larger than the full-scale voltage (Amplitude > FS Voltage) for calculating the static parameters using the histogram based method as explained in the previous section. On the other hand, for dynamic parameters the amplitude of the input sine-wave has been taken equal to the full-scale voltage (Amplitude = FS Voltage) of the ADC. Table 6.2 further summarizes 6 | Parameter | Values for Static Parameters | Values for Dynamic Parameters | |-------------------------------|------------------------------|-------------------------------| | ADC No. of Bits | 12 | 12 | | Amplitude | 2.54 [V] | 2.50 [V] | | Frequency | 100 [kHz] | 100 [kHz] | | DC Value | 2.50 [V] | 2.50 [V] | | Duration | 1 [s] | 1 [s] | | Samples | 1054144 | 1054144 | | Canaditan Manimum Dannadatian | 10.50/ | 10.50/ | Table 6.2: Values used for determining the static and dynamic performance parameters in the SAR ADC performance-analysis system shown in Figure 6.3. Table 6.3: Degraded values of each capacitor of twenty randomly generated DAC capacitor arrays for analysing the static (upper 10 values) and dynamic (lower 10 values) performance parameters of the SAR ADC ( $\times C_0 = \text{multiples of } C_0$ ). | Capacitor | | | | | | | | | | | | | | |-----------------|-----------------|-----------------|-----------------|----------------|----------------|----------------|----------------|----------------|--------|----------------|----------------|----------------|----------------| | Values<br>(xC₀) | C <sub>12</sub> | C <sub>11</sub> | C <sub>10</sub> | C <sub>9</sub> | C <sub>8</sub> | C <sub>7</sub> | C <sub>6</sub> | C <sub>5</sub> | C₄ | C <sub>3</sub> | C <sub>2</sub> | C <sub>1</sub> | C <sub>0</sub> | | ADC 1 | 2044.5773 | 1026.1637 | 514.4560 | 256.5890 | 127.9409 | 64.2214 | 32.0391 | 16.0180 | 7.9928 | 4.0189 | 2.0064 | 1.0003 | 1.0024 | | ADC 2 | 2043.4929 | 1019.1582 | 513.8459 | 255.2488 | 127.7026 | 64.1403 | 32.1291 | 15.9847 | 7.9623 | 3.9957 | 1.9922 | 0.9964 | 0.9968 | | ADC 3 | 2040.3306 | 1022.3928 | 509.4785 | 256.1279 | 128.2427 | 63.9777 | 31.8510 | 15.9628 | 7.9653 | 3.9824 | 1.9905 | 1.0007 | 1.0032 | | ADC 4 | 2048.9660 | 1023.5137 | 509.5469 | 255.8567 | 127.4749 | 63.9945 | 31.9708 | 15.9526 | 8.0270 | 4.0106 | 1.9967 | 1.0037 | 1.0040 | | ADC 5 | 2051.6210 | 1019.7001 | 510.7234 | 256.8319 | 127.5120 | 63.9720 | 32.0973 | 16.0159 | 8.0293 | 3.9867 | 1.9933 | 0.9963 | 0.9984 | | ADC 6 | 2046.3124 | 1020.3507 | 509.9676 | 257.2222 | 128.0721 | 64.3058 | 32.0080 | 15.9273 | 7.9714 | 4.0163 | 2.0047 | 0.9969 | 1.0005 | | ADC 7 | 2050.3543 | 1027.2532 | 511.6430 | 255.0103 | 127.4528 | 63.6988 | 31.9102 | 15.9693 | 7.9796 | 4.0052 | 1.9970 | 1.0018 | 1.0004 | | ADC 8 | 2052.5054 | 1025.6668 | 510.7208 | 255.7884 | 127.7577 | 63.9915 | 31.9982 | 16.0757 | 8.0163 | 4.0156 | 2.0089 | 0.9956 | 1.0015 | | ADC 9 | 2043.7981 | 1020.1156 | 513.3331 | 256.9407 | 128.0878 | 64.1723 | 31.9197 | 15.9957 | 7.9874 | 4.0005 | 1.9989 | 0.9967 | 1.0011 | | ADC 10 | 2044.5828 | 1026.8325 | 510.9647 | 255.4617 | 128.4316 | 64.1378 | 32.1179 | 15.9434 | 7.9899 | 3.9859 | 2.0036 | 0.9995 | 0.9985 | | Ideal | 2048.0000 | 1024.0000 | 512.0000 | 256.0000 | 128.0000 | 64.0000 | 32.0000 | 16.0000 | 8.0000 | 4.0000 | 2.0000 | 1.0000 | 1.0000 | | ADC 1 | 2058.0785 | 1025.0240 | 512.4490 | 256.0151 | 128.4292 | 64.1949 | 32.0021 | 15.9300 | 8.0223 | 4.0039 | 1.9934 | 1.0028 | 0.9988 | | ADC 2 | 2054.7004 | 1025.2771 | 512.7750 | 256.8353 | 127.5917 | 63.9159 | 32.1061 | 15.9734 | 7.9810 | 4.0193 | 1.9920 | 0.9955 | 1.0010 | | ADC 3 | 2043.0979 | 1026.3997 | 510.9438 | 255.1742 | 127.5556 | 64.2060 | 31.9583 | 16.0302 | 8.0220 | 3.9971 | 2.0039 | 1.0015 | 1.0021 | | ADC 4 | 2041.7161 | 1020.4014 | 509.8172 | 256.0988 | 128.5292 | 63.8461 | 31.9902 | 16.0640 | 8.0232 | 4.0033 | 1.9991 | 1.0025 | 0.9967 | | ADC 5 | 2038.9944 | 1027.8513 | 511.7754 | 256.2692 | 128.1075 | 64.1867 | 32.1047 | 16.0026 | 8.0199 | 4.0044 | 2.0025 | 0.9963 | 0.9998 | | ADC 6 | 2044.9683 | 1028.2846 | 510.6185 | 256.7972 | 128.6365 | 64.0923 | 31.9056 | 16.0086 | 8.0198 | 4.0122 | 1.9998 | 1.0011 | 0.9956 | | ADC 7 | 2047.2206 | 1021.0516 | 513.7262 | 255.3050 | 127.3717 | 64.1836 | 32.0583 | 15.9871 | 8.0283 | 4.0093 | 2.0083 | 1.0030 | 0.9995 | | ADC 8 | 2054.4614 | 1029.1170 | 510.1475 | 254.7552 | 128.3615 | 63.7708 | 31.8833 | 15.9537 | 8.0271 | 3.9831 | 1.9901 | 0.9961 | 0.9983 | | ADC 9 | 2048.7245 | 1027.1724 | 514.2769 | 256.5127 | 128.4187 | 64.2682 | 31.9778 | 16.0398 | 7.9739 | 3.9815 | 2.0014 | 1.0022 | 0.9994 | | ADC 10 | 2042.7861 | 1025.4958 | 509.5706 | 256.6056 | 127.5253 | 63.9892 | 31.9832 | 15.9612 | 7.9745 | 4.0173 | 2.0070 | 0.9974 | 0.995 | the values that have been used for analysing the static and the dynamic performance parameters of the ADC. Twelve bits (12-bits) being in between the high and low resolution ADCs have been used for simulating the ADC performance. Similarly, actual degraded values of the buffer and the comparator offset values at different stress points in time can be used as discussed in [Wan11]. In the present simulation setup, a range of values in-between $\pm 10mV$ have been used for the change in the buffer and the comparator offset voltages as a result of aging effects. Based on the equations (6.10) and (6.11), twenty randomly degraded DAC capacitor arrays have been generated. Most importantly, the following assumption has been made while generating these random DAC capacitor arrays. #### **Assumption:** It is assumed that the maximum change or degradation that can occur in each capacitor ${}^{\circ}C_{N}{}^{\circ}$ during the total stress time ${}^{\circ}t^{\circ}$ (e.g. 20 years) lies within $\pm 0.5\%$ of its ideal value $2^{N-1}C$ . It means, the parameter ${}^{\circ}P{}^{\circ}$ in equation (6.10) has been chosen to be $\pm 0.5\%$ . This could be an exaggerated value but the idea is to test the system under a worst-case scenario. Table 6.4: 12-bit SAR ADC output offset voltage values extracted from Figure 6.4(a). | V <sub>OS_ADC</sub> [mV] | $V_{OS\_COMP} = -10 \ mV$ | $V_{OS\_COMP} = 0 \ mV$ | $V_{OS\_COMP} = 10 mV$ | |---------------------------|---------------------------|-------------------------|-------------------------| | $V_{OS\_BUFF} = -10 \ mV$ | -15.03 | -05.01 | 05.01 | | $V_{OS\_BUFF} = 0 \ mV$ | -09.99 | 00.00 | 09.99 | | $V_{OS\ BUFF} = 10\ mV$ | -04.99 | 04.99 | 14.97 | 132 Twenty randomly degraded DAC capacitor array values based on the above assumption have been used in the SAR ADC simulation setup for analysing the static and dynamic performance parameters. The upper ten values in Table 6.3 have been used for analysing the static performance parameters whereas the lower ten values in Table 6.3 have been used for analysing the dynamic performance parameters respectively. The reason to use separate values for static and dynamic parameters lies in the fact that it is hard to generate the same random values in the simulation setup because they are randomly generated during the simulations. Simulations have been conducted for each of these ADCs according to the values given in Tables 6.2 and 6.3. The results, based on the output of each randomly degraded ADC, are used to calculate the static and dynamic performance parameters and are discussed in the next section. #### 6.5 SIMULATION RESULTS In order to describe the obtained results properly they have been divided separately into static parameter and dynamic parameter results as discussed below. #### 6.5.1 STATIC PERFORMANCE PARAMETER DEGRADATION RESULTS Four static parameters being the offset, gain, DNLE and INLE of the SAR ADC are considered and discussed here. #### 6.5.1.1 THE SAR ADC OUTPUT OFFSET VOLTAGE DEGRADATION Figure 6.4(a) shows the degradation in the output offset voltage of the 12-bit ADC, having an ideal DAC capacitor array, as a result of degradation in the buffer and comparator offset voltages. It reveals there is a linear relationship between the buffer and the comparator offset voltage and the ADC output offset voltage as a result of linear degradation in the buffer and comparator offset voltages. However, a nonlinear relation could be seen in the case of nonlinear degradation effects in the buffer and comparator offset voltages. An interesting point to note here is that the $1 \, mV$ change in the buffer offset results in $0.5 \, mV$ change in the output offset voltage and the $1 \, mV$ change in the comparator offset results in $1 \, mV$ change in the output offset voltage as shown in Table $6.4 \, (3^{\rm rd} \, \text{column} \, \text{and} \, 3^{\rm rd} \, \text{row})$ . Figure 6.4(a) indicates that the change in the output offset voltage of the SAR ADC is highly affected by the buffer and comparator offset-degraded values. Figure 6.4(b) depicts the output offset voltage standard deviation of ten randomly degraded ADCs, having degraded DAC capacitor array values (Table 6.3), as a function of the buffer and comparator offset voltage degradations. This figure indicates the change in the output offset voltage of the SAR ADC has almost a negligible effect from Figure 6.4: a) Output offset voltage degradation of a single SAR ADC with ideal DAC capacitor array and b) output offset standard deviation of ten 12-bit SAR ADCs with random DAC capacitor array-degradation as a result of the degradation in the buffer offset voltage $(-10mV \le V_{OS,GMP} \le 10mV)$ and the comparator offset voltage $(-10mV \le V_{OS,GMP} \le 10mV)$ respectively. Figure 6.5: The 'gain' parameter degradation of a single SAR ADC with ideal DAC capacitor array as a result of the degradation in the buffer offset voltage $(-10mV \le V_{OS,BUFF} \le 10mV)$ and the comparator offset voltage $(-10mV \le V_{OS,COMP} \le 10mV)$ . the capacitor degradations (Figure 6.4(b)) as a result of aging effects. The vertical colour bar in these figures shows the numerical values corresponding to each colour. #### 6.5.1.2 THE SAR ADC GAIN DEGRADATION Figure 6.5 shows the change in the 'gain' parameter of the 12-bit ADC having an ideal DAC capacitor array. It indicates that there is almost no effect on 'gain' parameter as a result of degradation in the buffer and the comparator offset voltages. Similarly, Figure 6.6(a) shows the 'gain' parameter variation of each of the ten randomly degraded ADCs (Table 6.3) as a function of the buffer and comparator offset voltage degradations. Figure 6.6(b) shows the corresponding standard deviation in the 'gain' parameter. The nearly flat surface of each randomly degraded ADC in Figure 6.6(a) indicates that contrary to the output offset voltage (above section), the change in the 'gain' parameter of each randomly degraded ADCs is affected less by the buffer and comparator offset voltage degradations. However, the standard deviation of the 'gain' parameter for these ten ADCs in Figure 6.6(b) depicts that it has changed, though very small, as a result of capacitor degradations in the DAC capacitor array. This change in Figure 6.6: (a) 'Gain' parameter degradation surface, and (b) 'gain' parameter standard deviation of ten 12-bit SAR ADCs with random DAC capacitor-array degradation as a result of the degradation in the buffer offset voltage $(-10mV \le V_{OS,COMP} \le 10mV)$ and the comparator offset voltage $(-10mV \le V_{OS,COMP} \le 10mV)$ respectively. Figure 6.7: a) Output differential nonlinearity error (DNLE) and b) output integral nonlinearity error (INLE) standard deviation of a single SAR ADC with ideal DAC capacitor array as a result of the degradation in the buffer offset voltage $(-10mV \le V_{OS,COMP} \le 10mV)$ and the comparator offset voltage $(-10mV \le V_{OS,COMP} \le 10mV)$ respectively. the 'gain' parameter of the ADC turns out to be proportional to the percentage of degradation in the capacitor values due to the aging effects. #### 6.5.1.3 THE SAR ADC DNLE AND INLE DEGRADATION Figures 6.7(a) and 6.7(b) show the DNL error (DNLE) and INL error (INLE) standard deviation values of the 12-bit ADC having an ideal DAC capacitor array. From these figures it can be observed that the DNLE and INLE values for a 12-bit ADC are very small as a result of the degradation in the buffer and the comparator offset voltages. Figures 6.8(a) and 6.8(b) depict the DNLE and INLE standard deviation of each of the ten randomly degraded ADCs as a function of the buffer and comparator offset voltage degradations. The nearly flat DNLE and INLE surface of each of the ten randomly degraded ADCs in Figures 6.8(a) and 6.8(b) give rise to the conclusion that the degradation in the buffer and comparator offset voltages have a negligible effect on Figure 6.8: (a) DNLE standard deviation surface and (b) INLE standard deviation surface of ten 12-bit SAR ADCs with random DAC capacitor-array degradation as a result of the degradation in the buffer offset voltage $(-10mV \le V_{OS,BUFF} \le 10mV)$ and the comparator offset voltage $(-10mV \le V_{OS,COMP} \le 10mV)$ respectively. Figure 6.9: Degradation of the output signal-to-noise and distortion (SINAD) of a single 12-bit SAR ADC with ideal DAC capacitor array as a result of the degradation in the buffer offset voltage $(-10mV \le V_{OS,BUFF} \le 10mV)$ and the comparator offset voltage $(-10mV \le V_{OS,COMP} \le 10mV)$ . the DNLE and INLE values. However, the random capacitor degradation in the DAC capacitor array will lead to severe effects in the DNLE and INLE values (the vertical surface shift in Figures 6.8(a) and 6.8(b)). The vertical colour bar in these figures shows the numerical values corresponding to each colour. #### 6.5.2 Dynamic Performance Parameter Results Three dynamic performance parameters being the signal-to-noise and distortion (SINAD), total harmonic distortion (THD), and effective number of bits (ENOB) are considered and discussed here. #### 6.5.2.1 SAR ADC SINAD, THD AND ENOB DEGRADATION Figure 6.9 shows the degradation in SINAD of the 12-bit ADC, having an ideal DAC capacitor array, as a result of the degradation in the buffer and the comparator offset voltages. This shows that the SINAD is more sensitive to the buffer offset voltage degradation as compared to the comparator offset voltage degradation. Figures 6.10(a) Figure 6.10: Degradation in (a) SINAD of each of ten 12-bit SAR ADCs and (b) the combined standard deviation in SINAD of all ten 12-bit SAR ADCs with random DAC capacitor-array degradation as a result of the degradation in the buffer offset voltage $(-10mV \le V_{OS,BUFF} \le 10mV)$ and the comparator offset voltage $(-10mV \le V_{OS,COMP} \le 10mV)$ respectively. Figure 6.11: Degradation of the output total harmonic distortion (THD) of a single 12-bit SAR ADC with ideal DAC capacitor array as a result of the degradation in the buffer offset voltage $(-10mV \le V_{OS,BUFF} \le 10mV)$ and the comparator offset voltage $(-10mV \le V_{OS,COMP} \le 10mV)$ . and 6.10(b) show the change in SINAD of each of the ten randomly degraded ADCs (Table 6.3) and the combined standard deviation in SINAD of all ten randomly degraded ADCs respectively. This shows that the degradation in SINAD is also highly sensitive to the DAC capacitor-array degradations. Figure 6.11 shows the degradation in THD of the 12-bit ADC, having an ideal DAC capacitor array, as a result of the degradation in the buffer and the comparator offset voltages. This shows that, similar to SINAD degradation, the THD degradation is also very sensitive to buffer offset voltage degradation as compared to comparator offset voltage degradation. Figures 6.12(a) and 6.12(b) show the change in THD of each of the ten randomly degraded ADCs (Table 6.3) and the combined standard deviation in THD of all ten randomly degraded ADCs respectively. This shows that the degradation in THD is also highly sensitive to the DAC capacitor-array degradations. Figure 6.13 shows the degradation in ENOB of the 12-bit ADC, having an ideal DAC capacitor array, as a result of the degradation in the buffer and the comparator offset voltages. This shows that, similar to THD and SINAD degradation, the ENOB degradation is also very sensitive to buffer offset voltage degradation as compared to (136) E R 6 Figure 6.12: Degradation in (a) THD of each of ten 12-bit SAR ADCs and (b) the combined standard deviation in THD of all ten 12-bit SAR ADCs with random DAC capacitor-array degradation as a result of the degradation in the buffer offset voltage $(-10mV \le V_{OS,COMP} \le 10mV)$ and the comparator offset voltage $(-10mV \le V_{OS,COMP} \le 10mV)$ respectively. Figure 6.13: Degradation of the effective number of bits (ENOB) of a single 12-bit SAR ADC with ideal DAC capacitor array as a result of the degradation in the buffer offset voltage $(-10mV \le V_{OS,BUFF} \le 10mV)$ and the comparator offset voltage $(-10mV \le V_{OS,COMP} \le 10mV)$ . comparator offset voltage degradation. Figures 6.14(a) and 6.14(b) show the change in ENOB of each of the ten randomly degraded ADCs (Table 6.3) and the combined standard deviation in ENOB of all ten randomly degraded ADCs respectively. It reveals that the degradation in ENOB is also highly sensitive to DAC capacitor-array degradations. The above results lead to the conclusion that the SINAD, THD, and ENOB become worse as the buffer and the comparator offset voltages move towards the -10 mV value. This is because of the fact that according to equation (6.7) the comparator offset voltage will try to cancel out the buffer offset voltage. However, the DAC capacitor array will introduce some part of the buffer offset voltage at node C (Figure 6.1). This could saturate the comparator in one direction and then introduce distortion in the actual value. This distortion will further lead to reduction of SINAD, THD and ENOB. Some of the values extracted from the above figures are summarized in Table 6.5, which shows that the best values of SINAD, THD, and ENOB are achieved in the triangular region formed by the highlighted dotted-red boxes. Figure 6.14: Degradation in (a) ENOB of each of ten 12-bit SAR ADCs and (b) the combined standard deviation in ENOB of all ten 12-bit SAR ADCs with random DAC capacitor-array degradation as a result of the degradation in the buffer offset voltage $(-10mV \le V_{OS,BUFF} \le 10mV)$ and the comparator offset voltage $(-10mV \le V_{OS,COMP} \le 10mV)$ respectively. | Table 6.5: SINAD, THD, and ENOB valu | es extracted from Figures 6.9, 6.11 and 6.13. | |--------------------------------------|-----------------------------------------------| |--------------------------------------|-----------------------------------------------| | SINAD [dB] THD [dB] ENOB [bit] | $V_{OS\_COMP} = -10 \ mV$ | $V_{OS\_COMP} = 0 \ mV$ | $V_{OS\_COMP} = 10 \ mV$ | |--------------------------------|---------------------------|-------------------------|--------------------------| | $V_{OS\_BUFF} = -10 \ mV$ | 55.68 | 62.52 | 62.52 | | | -61.11 | -69.68 | -69.68 | | | 8.957 | 10.09 | 10.09 | | $V_{OS\_BUFF} = 0 \ mV$ | 62.51 | 73.98 | 62.51 | | | -69.70 | <i>-104.50</i> | -69.70 | | | 10.09 | 12.00 | 10.09 | | $V_{OS\_BUFF} = 10 \ mV$ | 73.98 | 73.98 | 62.51 | | | -107.30 | -107.30 | -69.67 | | | 12.00 | 12.00 | 10.09 | The combined standard deviation in SINAD, THD and ENOB (Figures 6.10(b), 6.12(b), and 6.14(b) respectively) of all the ten ADCs show the total degradation that can be expected in 20 years under the assumption made in section 6.4. These figures show that the dynamic parameters are severely affected due to the degradation in the buffer and comparator offset voltages as well as due to the random degradation in the capacitors of the DAC capacitor array. #### 6.5.3 SUMMARY OF SIMULATION RESULTS Summarising the above results one can say that among the static parameters, the offset, and among the dynamic parameters, the SINAD, THD and ENOB, are severely affected by the buffer and comparator offset degradations. In contrast the static parameters like gain, DNLE, and INLE are minimally affected by the buffer and comparator offset degradations. On the other hand, all of the static, except offset voltage, and all of the dynamic performance parameters of the ADC are severely affected by the capacitor degradations in the DAC capacitor array. These results have been summarized in Table 6.6. 6 | Degradation<br>Effect | Offset | Gain | DNLE | INLE | SINAD | THD | ENOB | |-------------------------------------|---------------|---------------|---------------|---------------|--------------|--------------|--------------| | Buffer<br>Offset<br>Degradation | Huge | Very<br>Small | Very<br>Small | Very<br>Small | Very<br>Huge | Very<br>Huge | Very<br>Huge | | Comparator<br>Offset<br>Degradation | Very<br>Huge | Very<br>Small | Very<br>Small | Very<br>Small | Very<br>Huge | Very<br>Huge | Very<br>Huge | | DAC<br>Capacitor<br>Degradation | Very<br>Small | Small | Very<br>Huge | Very<br>Huge | Very<br>Huge | Very<br>Huge | Very<br>Huge | Table 6.6: The effect of the buffer and comparator offset voltage and DAC capacitor-array degradations on the static and dynamic performance parameters of the SAR ADC. #### 6.6 POTENTIAL CRITICAL PERFORMANCE PARAMETERS As discussed in the previous chapters, system-level performance parameters can possibility be used to estimate the runtime (during the operational life) reliability of a system. For this, it has first to be established which system-level performance parameters are critical to aging effects and how many system-level performance parameters are required to estimate the runtime reliability of the system. Therefore, in the present case of a SAR ADC we have already achieved some results about the degradation effects in its system-level performance parameters, as discussed in section 6.5.3. These results show which system-level performance parameters (e.g. output offset voltage) are critical to aging effects and which parameters are responsible for their degradation. Therefore, from Table 6.4 one can draw the following conclusions: - The buffer and comparator offset voltages and the DAC capacitor-array degradations are the two main contributors (assumed in this chapter for simplicity) in degrading the static and dynamic performance parameters of the SAR ADC. - 2) No single system-level performance parameter (static or dynamic) can be considered as the best reliability indicator for the rest of the system-level performance parameters (explained below). - 3) The buffer and comparator offset voltage degradation and the DAC capacitor-array degradation show opposite results for the 'gain', DNLE and INLE parameters (Table 6.6). The buffer and comparator offset voltages have almost no effect on the 'gain', DNLE and INLE parameters. Whereas, the DAC capacitor-array degradation has an impact on the 'gain', DNLE and INLE parameters. - 4) If somehow the degradation in the output offset voltage value (2<sup>nd</sup> column in Table 6.6), as a result of buffer and comparator offset degradation, can be controlled then the corresponding degradation in the dynamic performance parameters can also be controlled (scenario 1, explained below). - 5) If somehow the degradation in the INLE value (5<sup>th</sup> column in Table 6.6), as a result of DAC capacitor-array degradation, can be controlled then the corresponding degradation in the gain parameter, DNLE, and all of the dynamic performance parameters can also be controlled (scenario 2, explained below). 6) Furthermore, all of the dynamic performance parameters are affected by the degradation in the buffer and comparator offset voltage degradation and DAC capacitor-array degradation (Table 6.6). Based on the above conclusions, the two system-level performance parameters, being the SAR ADC output offset voltage and INLE value, can be selected as the best indicators for degradation effects. The maximum or minimum DNLE value can also be equally selected as the best indicator in place of INLE value. The following scenarios can be extracted from the above discussion: **Scenario 1:** In the case, the output offset value is out of its specification boundaries/limits whereas the INLE value is within the specification boundaries/limits, it means the buffer and comparator offset voltage degradation are responsible for the degradation in the system-level performance parameters (Table 6.6). Therefore, by adjusting the buffer and comparator within their normal specification boundaries of offset voltage will be a possible solution to avoid degradations of the static and dynamic performance parameters. **Scenario 2:** In the case, the output offset voltage is within its specification boundaries/limits and the INLE value is out of its specification boundaries/limits (Table 6.6), it means the DAC capacitor-array degradation is responsible for the degradations in its system-level performance parameters. Therefore, by adjusting the DAC capacitor array values (explained in the next section) to their normal specification boundaries/limits could possibly be used to alleviate the degradation effects in the static and dynamic performance parameters. **Scenario 3:** In the case, both the output offset voltage as well as the INLE values are out of their specification boundaries (Table 6.6) then both the buffer and comparator offset voltage degradation and DAC capacitor-array degradations are responsible for the degradation in its system-level parameters. In this case, both the buffer and comparator offset voltage and DAC capacitor array values have to be adjusted back (explained in the next section) to their normal speciation boundaries/limits in order to avoid the corresponding degradation effects in the static and dynamic performance parameters. In conclusion, by introducing a monitoring and controlling mechanism for the two performance parameters, the output offset voltage and INLE value, the rest of the static and dynamic performance parameters can be maintained within their specification boundaries/limits (according to Table 6.6). The next section will discuss some possible ways to monitor and control these performance parameters and additional improvements in order to enhance the overall dependability of the SAR ADC. #### 6.7 PROPOSED DEPENDABILITY ENHANCEMENT STRATEGIES Enhancing the dependability of the SAR ADC means enhancing its individual attributes. The focus of this whole thesis has been on some important dependability attributes namely the reliability, the maintainability and the availability of the system. In this section possible strategies will be investigated how to improve the dependability of SAR ADCs for the above mentioned dependability attributes. According to the results and discussions presented in the above section, separate monitoring and control mechanisms for the SAR ADC output offset voltage and INLE values are required. The control of these performance parameters can be achieved by controlling the respective degradation-causing parameter. In our case this means controlling the buffer and comparator offset voltage and DAC capacitor values within their normal values of specifications. A number of possibilities to monitor and control these performance parameters are discussed below. # (141) C H Α P Τ E R 6 #### 6.7.1 MONITORING MECHANISMS The monitoring of the SAR ADC output offset voltage value is different from monitoring the simple offset voltage in an analog circuit. Here, digital values are present at the output of the SAR ADC instead of analog values. Therefore, measuring the offset voltage at the output of a SAR ADC requires a reference input signal with known values that can give us an estimate of the output offset voltage. Similarly, INLE values require a special method to be monitored during the operational life of the SAR ADC. Combining these two monitoring requirements, the on-chip sine-wave histogram method [Lee04, Aza01] seems to be a potential method to monitor SAR ADC output offset voltage and the INLE values. #### 6.7.2 CONTROLLING MECHANISMS Controlling the output offset and INLE degradation in the SAR ADC is more complicated than monitoring them. As mentioned in the previous sections, the degradations in the buffer and comparator offset are the major reason for degradation in the SAR ADC output offset. The DAC capacitor-array degradation is the major reason for causing INLE degradation. Therefore, controlling the SAR ADC output offset voltage and INLE value means controlling the buffer and comparator offset voltage and DAC capacitor-array values. #### 6.7.2.1 CONTROLLING THE BUFFER AND COMPARATOR OFFSET Controlling the buffer and comparator offset can be accomplished in the following ways. #### B. THE OFFSET CANCELLATION TECHNIQUE One way of dealing with the buffer and comparator offset voltage is the usage of offset cancellation techniques [Wu13]. Usually, there are three types of CMOS offset cancellation techniques, being trimming, chopping, and auto-zeroing. These in turn can be divided in two categories: static and dynamic offset cancellation techniques. Trimming, a static offset cancellation technique, is performed during the production phase to eliminate offset and can be applied once [Wu13]. Auto-zeroing, a dynamic offset cancellation technique, is a sampling technique in which the offset is measured and then subtracted in the subsequent clock phases. Chopping is another dynamic cancellation and continuous-time modulation technique in which the signal and offset are modulated to different frequencies [Wu13]. Since chopping and auto-zeroing are dynamic techniques that continuously reduce offset, they also remove low frequency 1/f noise as well as offset drift over temperature or time. Therefore, among these offset cancellation techniques, auto-zeroing being a sampling and dynamic technique can be used as a potential way to reduce the buffer and comparator offset degradations and hence a way to control the SAR ADC output offset voltage. # (142) #### C. DIGITAL TUNING TECHNIQUES FOR OFFSET Beside all of these auto offset cancellation techniques, a digital offset compensation technique [Wan11] can also be used. In this case, digitally controlled knobs are used to externally control the internal offset. Each digital step corresponds to a specific offset voltage compensation (e.g. $9\mu V$ ). Such type of digital tuning technique is not automatic. It first requires a monitoring methodology to estimate the exact amount of offset present and then a corresponding digital tuning value can be selected to compensate the present offset voltage. #### 6.7.2.2 CONTROLLING THE DAC CAPACITOR ARRAY VALUES Controlling the capacitances of the DAC capacitor array within the specification boundaries/limits is more challenging than the buffer and comparator offset voltage. Traditionally, unit-element topology and careful layout are often proposed for DAC capacitor improvements. However, these improvement techniques are static and can only be used either at the design or production phase. Maintaining the DAC capacitor array dynamic (time dependent) values or reducing time dependent degradation in the capacitors of DAC capacitor array requires a dynamic approach that can be controlled at different points of time during the life time of the SAR ADC. A possible solution that meets the dynamic requirements is the bitwise correlation (BWC) technique reported in [Liu12]. The author injects a discrete-time single-bit pseudorandom noise (PN) signal to the ADC input that is converted along with the analog input signal. This output is further used by the associated digital domain calibration engine to extract the bit waits simultaneously to be further used in the digital correction process. Another dynamic technique that can potentially be used to correct the DAC capacitor-array degradation is to split the ADC architecture with a fully digital background calibration and correction algorithm as presented in [Mcn11]. The author makes use of the split ADC architecture to get the difference of two independent outputs from each half-sized ADCs. This difference is further used in the background calibration algorithm to extract the correction parameters for correcting the DAC capacitor array induced errors. Furthermore, in order to avoid all possible complexities and associated overheads in terms area and speed, radical designs that are better in degradation tolerance over a range of input stress voltages and working stress temperatures should be researched and investigated to maintain the DAC capacitor array values during operational life. #### 6.7.3 DEPENDABILITY ENHANCEMENT STRATEGY Based on the above mentioned possible approaches to control the buffer and comparator offset voltage and DAC capacitor array values, the best strategy seems to use auto-zeroing as the dynamic offset cancellation technique and the background digital calibration technique, based on the bitwise correlation (BWC), to correct the DAC capacitor array values during the operational life of the SAR ADC. The scaling of CMOS device dimensions offers clear advantages for digital circuitry in terms of density, speed, and integration. Therefore, it becomes advantageous to further push the calibration into the digital domain as described above. Nowadays most of the A/D converters with offset cancellation make use of auto-zeroed comparators. However, the actual implementation of the auto-zeroing technique brings complications as well [Ana13]. The recent, more sophisticated auto-zeroing techniques employ multiple nulling loops, differential signal paths, multiple stage amplifiers, and Miller multiplied storage capacitors [Ana13]. Internal voltages are carefully controlled to prevent the saturation of nulling circuitry and hence to prevent the long overload times due to the complicated settling behaviour of these nulling loops. A careful design and layout is required to reduce the digital clock noise and aliasing effects. The size of the on-chip storage capacitors is also limited by the cost-effective die size. This requires small storage capacitors with efficient switch design and layout to reduce the offset errors introduced due to charge injection effects. Switch leakage must also be minimized to maintain the circuit accuracy, especially at high temperatures. The above technique to use auto-zeroing and background digital calibration has the effect that the static and dynamic performance parameters can stay within their specifications boundaries as a result of buffer, comparator and DAC capacitor degradations. Therefore, by using auto-zeroing and background digital calibration the reliability, being the probability as a function of time that the system will be functioning correctly at that time, has increased. However, in order to increase the availability, being the probability as a function of time that the system will be available for its service, it will still require additional actions to be taken. Of course, the system is available for its correct service and will be maintained automatically by employing the above mentioned strategies for a certain amount of operational life. However, depending on the nature of the application this may require further considerations. For example, mission-critical applications working in harsh environments that need to be available for very long operational times require redundant systems or sub-systems. This redundancy is essential in order to avoid permanent failures of a system because the availability will be affected in case of permanent failures in the system. Therefore, in order to avoid permanent failures and keep the system available for its correct service one requires to have fault-tolerant strategies based on redundant systems or sub-systems as discussed in the previous chapters. This will certainly increase the availability of the system. Similarly, the *maintainability*, being the probability as a function of time that the system can be repaired when it fails to function correctly, will also be enhanced by incorporating the above mentioned strategies. However, in order to decrease the repairing time it is required to have some monitoring mechanism; a monitoring mechanism that will monitor the performance of the system. This will be further used to take the necessary precautionary actions in advance in order to decrease the repairing time and hence a potential way to enhance the maintainability of the system. Figure 6.15: Proposed hardware architecture to enhance the dependability of the SAR ADC. Furthermore, the requirement to have monitoring mechanisms to take the precautionary actions in advance in order to decrease the repairing time and the redundancy requirements for improving availability of the system suggests to use a dependable hardware architecture as discussed in the next section. #### 6.7.3.1 DEPENDABLE HARDWARE ARCHITECTURE The dependability enhancement possibilities, discussed above, suggest that depending on the nature of the application one further requires to have redundancy and monitoring mechanisms in the system. The important thing to note here is that it all depends on the nature of the application and the possible working environment of the system. If for a particular application, the degradation of its performance parameters is not a major issue and they remain within their specifications for the required period of operational life then no corrections are required. However, if there are very strict requirements because of the criticality of the mission or application then one has no choice except to include redundancy and monitoring capabilities. Therefore in worst-case conditions, it is suggested to have a dependable hardware architecture as discussed in section 3.9.3. Figure 6.15 is the modified version of this hardware architecture to fit well with the SAR ADC architecture. It consists of redundant buffer and comparator parts. If one assumes that the DAC capacitor array is less prone to degradation as compared to the buffer and comparator, then a single DAC capacitor array will be sufficient for a specific duration of operational life as shown in Figure 6.15. The monitoring of the performance parameters, especially the SAR ADC offset voltage and INLE value, can be carried out via the traditional sine-wave histogram method as mentioned above. To facilitate this, there is a dedicated on-chip sine-wave generating circuit and digital circuitry along with an associated memory to carry out all the histogram-based performance analysis. Similarly, to facilitate the C Н Α P Τ E R 6 As described above, on one side the auto-zeroing can introduce implementation complexities and on the other side one has to have a monitoring mechanism for the buffer and comparator offset voltages. Therefore, instead of using auto-zeroing one can use digital offset tuning methods. By already having a monitored value of the buffer and comparator offset voltage an equivalent digital tuning value can be calculated in advance and can be used instantly to bring the offset voltage value within their specification boundaries. Therefore, the proposed hardware architecture in Figure 6.15 can be used to enhance the dependability of the SAR ADC in a similar way as the proposed hardware platform discussed in Chapter 3. In special cases, where the option of human intervention is not possible, like space missions, the number of redundant buffers and comparators can also be increased depending on the total lifetime requirements. In normal case, without further increasing number of redundant sub-blocks, a rough estimate shows that the total area overhead is probably higher than 100%. #### 6.8 CONCLUSIONS In this chapter, system-level behavioural models have been used in order to analyse the static and dynamic performance parameter degradations of the charge-redistribution SAR ADC as a result of aging degradations in the buffer, comparator and DAC capacitor array. Degradation effects of the buffer, comparator and DAC capacitor array have been modelled and incorporated in the system-level model of the SAR ADC. These models have been further used in a flexible performance-analysis simulation setup designed in the LabVIEW environment where different input parameters including degradation values of different building blocks can be selected by the user. Simulation results of this simulation setup show that the ADC performance parameters like the SINAD, THD and ENOB are heavily affected. Other performance parameters like the offset, gain, DNLE, and INLE are minimally affected as a result of degradation in the buffer and comparator offset voltages. On the other hand, all of the static (except offset voltage) and all of the dynamic performance parameters of the SAR ADC are severely affected due to DAC capacitor-array degradations. Based on these results two performance parameters, the output offset voltage and INLE, have been selected to be the best indicators for the degradation of the SAR ADC. Different possibilities to mitigate the cause of these degradations in the output offset voltage and INLE values are discussed and further used in the proposed dependability enhancement strategies for the SAR ADC. Depending on the worst-case application requirements, a dependabilityenhancing hardware architecture has been proposed that utilizes hardware redundancy as well as performance monitoring mechanisms. #### 6.9 REFERENCES [Ana13] Auto-Zero Amplifiers White Paper, Analog Devices, Dec. 2013 (http://www.analog.com/static/imported-files/tech articles/197774722Autozero.whitepaper.doc) [Atx13] http://www.atx7006.com/articles/adc histogram test (Dec. 2013) [Aza00] F. Azais, S. Bernard, Y. Betrand, and M. Renovell, "Towards an ADC BIST scheme using the histogram test technique," in IEEE Proc. European Test Workshop, pp. 53-58, 2000. [Aza01] F. Azais, S. Bernard, Y. Bertrand, and M. Renovell, "Optimizing sinu-soidal histogram test for low cost ADC BIST," in Journal of Electron. Test.: Theory Appl., Vol. 17, No. 3/4, pp. 255–266, 2001. 146 [Bao09] Y. Baoguang, F. Qingguo, J.B. Bernstein, Q. Jin, and D. Jun, "Reliability Simulation and Circuit-Failure Analysis in Analog and Mixed-Signal Applications," in IEEE Tran. on Device and Materials Reliability, Vol. 9, No.3, pp. 339-347, 2009. [Chi07] H. Chi-Chao, et al., "An Innovative Understanding of Metal-Insulator-Metal (MIM)-Capacitor Degradation Under Constant-Current Stress," in IEEE Tran. on Device and Materials Reliability, Vol. 7, No. 3, pp. 462-467, 2007. [Gra06] N. Gray, "ABCs of ADCs," by National Semiconductor Corporation, Rev 3, 2006. [Hae10] S. Haenzsche, S. Henker, and R. Schuffny, "Modelling of capacitor mismatch and non-linearity effects in charge redistribution SAR ADCs," in Proc. IEEE Int. Conf. Mixed Design of Integrated Circuits and Systems (MIXDES), pp. 300-305, 2010. [Hot09] M. K. Hota, et al., "Reliability behavior of TaAlOx Metal-Insulator-Metal capacitors," in IEEE Int. Symp. Physical and Failure Analysis of Integrated Circuits, pp. 803-806, 2009. [Hsi08] T. Hsin-Wen, L. Bin-Da, and C. Soon-Jyh, "A Histogram-Based Testing Method for Estimating A/D Converter Performance," in IEEE Tran. Instrumentation and Measurement, Vol. 57, No. 2, pp. 420-427, 2008. [Kha13] M.A. Khan, and H.G. Kerkhoff, "Analysing Degradation Effects in Charge-Redistribution SAR ADCs," in IEEE Int. Symp. Defect and Fault Tolerance in VLSI and Nanotechnology Systems (DFT), pp. 65-70, 2013. [Kha14] M.A. Khan, and H.G. Kerkhoff, "Studying DAC Capacitor-Array Degradation in Charge-Redistribution SAR ADCs," in IEEE Int. Symp. Design and Diagnostics of Electronic Circuits & Systems (DDECS), pp. 15-20, 2014. [Kri10] S. K. Krishnappa, H. Singh, and H. Mahmoodi, "Incorporating Effects of Process, Voltage, and Temperature Variation in BTI Model for Circuit Design," in IEEE Latin American Symposium on Circuits and Systems, pp. 236-239, 2010. [Kug00] T. Kugelstadt, "The operation of the SAR-ADC based on charge redistribution," in Texas Instruments Analog Applications Journal, pp. 10-11, 2000. [Lee04] D. Lee, K. Yoo, K. Kim, G. Han, and S. Kang, "Code-width testing-based compact ADC BIST circuit," in IEEE Trans. Circuits Syst. II, Vol. 51, No. 11, pp. 603–606, 2004. [Liu12] W. Liu, P. Huang, and Y. Chiu, "A 12-bit 50-MS/s 3.3-mW SAR ADC with background digital calibration," in IEEE Int. Conf. Custom Integrated Circuits Conference (CICC), pp. 1-4, 2012. [Mar11] E. Maricau, et al., "A compact NBTI model for accurate analog integrated circuit reliability simulation," in Proceedings of the European Solid-State Device Research Conference (ESSDERC), pp. 147-150, 2011. [Mcn11] J.A. McNeill, K.Y. Chan, M.C.W. Coln, C.L. David, and C. Brenneman, "All-Digital Background Calibration of a Successive Approximation ADC Using the "Split ADC" Architecture," in IEEE Tran. Circuits and Systems I: Regular Papers, Vol. 58, No. 10, pp. 2355-2365, 2011. [Pei10] X. Pei, and P. Wang, "Design and modeling of a 12-bit SAR ADC IP with non-lumped capacitor array," in IEEE Int. Conf. on Future Computer and Communication (ICFCC), Vol. 3, pp. 392-395, 2010. [Sed11] N. Sedghi, W. Davey, I.Z. Mitrovic, and S. Hall, "Reliability studies on $Ta_2O_5$ high- $\kappa$ dielectric metal-insulator-metal capacitors prepared by wet anodization," in Journal of Vacuum Science & Technology B: Microelectronics and Nanometer Structures, Vol. 29, No. 1, pp. 01AB10-01AB10-8, 2011. [Wan11] J. Wan and H.G. Kerkhoff, "Boosted gain programmable OpAmp with embedded gain monitor for dependable SoCs," in IEEE Int. SoC Design Conference (ISOCC), pp. 294-297, 2011. [Wu10] S.H. Wu, C.K. Deng, T.H. Hou, and B.S. Chiou, "Stability and Degradation Mechanism of La2O3 Metal-Insulator-Metal Capacitors under Constant Voltage Stress," in ECS (Electrochemical Society) Meeting Abstracts, Abstract No. 33, April 2010. [Wu13] R. Wu, J.H. Huijsing, and K.A.A. Makinwa, "Precision Instrumentation Amplifier and Read-Out Integrated Circuits," Springer Publishing Press, ISBN 978-1-4614-3730-7, 2013. [Zha08] B. Zhang, and M. Orshansky, "Modeling of NBTI-Induced PMOS Degradation under Arbitrary Dynamic Temperature Variation," in Int. Symp. on Quality Electronic Design (ISQED), pp. 774-779, 2008. 147 C H A P T E R # CONCLUSIONS, CONTRIBUTIONS AND FUTURE WORK In this chapter, the final conclusions and recommended future work of our research work is presented. First, in section 7.1, the summary of all research work presented in this thesis is described. In sections 7.2 and 7.3, the conclusions and main contributions are stated. Possible limitations of the current research work are discussed in section 7.4. Finally, in section 7.5 some possible directions for future work, as a continuation of our research work presented in this thesis, are presented. #### 7.1 SUMMARY OF THE RESEARCH WORK Electronic systems have become an essential part of our contemporary society. They are frequently used in consumer, as well as safety-critical systems. These systems usually require interactions with the environment, which is analog in nature. Therefore, being the interface between the real world and the digital world, analog and mixedsignal (AMS) front ends are an essential part of these electronic systems. Furthermore, in designing these electronic systems the cost, area, and power consumed are an important issue. To cope with these issues, the semiconductor industry continues to scale transistor devices to smaller nanometer CMOS technologies. However, to maintain the effective performance scaling, the oxide electrical fields and current densities are increasing continuously. On one side the technology scaling has improved the electronic systems in their performance, low energy consumption and lower die cost. However, on the other hand, it has introduced new special, temporal and dynamic variations, Failure mechanisms like Bias Temperature Instability (BTI), Hot Carrier Injection (HCI), Electro-migration (EM), and Time Dependent Dielectric Breakdown (TDDB) can have a significant impact on the lifetime of electronic systems. Therefore, advanced CMOS technology nodes have now reached values where it is very challenging to guarantee the system operation over the intended product lifetime and hence their dependability. This means, in order to have a dependable interface between the real world and the digital world in advanced CMOS technologies, one requires to have dependable analog and mixed-signal (AMS) front ends. This dependable interface will be different from the reliable interface (where only reliability has been considered). For a dependable interface it has to be reliable as well as available and maintainable. We assume three attributes of dependability in this thesis, reliability, availability and maintainability as being the desired and essential interface properties. In other words, the AMS interface has to be: - 1) Functioning correctly at any time under a given set of operating conditions (Reliability) - 2) Available to correctly deliver its services at any time (Availability) - 3) Having the capability to repair at any time if it fails to deliver correct services (Maintainability) Therefore, making the AMS interface dependable will require addressing the problem from three different angles. First, strategies have to be incorporated to make the AMS interface reliable so that it is always functioning correctly in its intended environment. Second, strategies have to be incorporated in order to make it available at any time with zero (ideally) down time. This means, as a third requirement, it should be repaired immediately (ideally) if it fails to provide its correct service. A number of design-stage simulation techniques are already available to address the reliability issues in devices and associated small circuits. However, more complex systems, especially AMS systems, require special attention. This is because these conventional transistor-based simulators are extremely slow to analyze degradation effects. A solution to this problem, presented in this research work, is to use systemlevel behavioral models along with degradation information. This degradation information can be extracted from actual life-time tests (measurements). Alternately, for complex systems the whole system can be divided into simpler sub-systems (where possible) and the degradation information of each sub-system can be extracted by using conventional reliability simulators. Later, this degradation information can be used in system-level behavioral models to simulate the degradation effects in the complete (complex) system (Chapters 3 and 6). A number of conventional system-level simulation tools like VHDL-AMS or LabVIEW can be used for this purpose. The degradation information of a set of important performance parameters, specific to an application, can either be used to build robust designs or it can be used to extract the critical performance parameters with regard to degradation effects. Making a robust design for a number of performance parameters is considered complex, sensitive to small variations/noise and extremely time consuming, especially for AMS systems. Another possibility is to use this information for selecting critical performance parameters in terms of degradation effects. These critical performance parameters can be regularly monitored during the operational life of a complex electronic system to enable the critical decisions (digital tuning or replacement) in order to enhance its reliability. These decisions can be triggered by providing external digital-stimuli based tuning/adjustment mechanisms in these complex electronic systems. Examples of these systems include digitally-assisted AMS systems where external digital signals are used to realize, control, improve, and change the circuit functionalities in the analog domain. This implies regular observations and control mechanisms to implement this concept. This method is introduced and implemented (simulated) in Chapter 3. Once the reliability issue is tackled, the next point is to make the system available for correct operation at any time. This concept is rarely addressed in literature for AMS interfaces. However, this is an important property of AMS interfaces and has to be considered equally in this research. This problem can become a potential bottleneck and may lead to complete AMS interface failure. 151 One approach, which has been presented in this research work, is to introduce other spare AMS interfaces (units) in parallel to the original one. In our case, two spare units have been introduced. This points to the fact that hardware redundancy is essentially important for fault-tolerant behaviour. A switching mechanism can be introduced to activate the redundant hardware when needed. In order to reduce power requirements, the redundant parts can be completely isolated from the power supply and can be activated only if they are being replaced with the faulty units. To reduce the switching time, or downtime, possible future faults can be predicted in advance by regularly analysing their performance as described above. This will certainly reduce the repairing time or the maintainability of the system will increase. This means that the goal to implement a dependable AMS interface can be accomplished by 1) regularly monitoring the critical performance parameters with regard to degradation mechanisms, 2) introducing digital assistance in terms of external digital tuning/adjustment of critical performance parameters and 3) incorporating switching-based replacement of faulty units with spare units. A major problem in this proposed scheme is how to reduce the area overhead introduced by redundant units and the associated complexity, power consumption, and speed degradation. This has been solved by optimizing the dependability requirements along with other issues (e.g. power, speed etc.). The idea is to design a complete dependable IP library, for each individual IP, having different flavours of the same functional IPs that have different reliability, availability, maintainability, area, power, and speed values. Depending on the requirements set by the user or application, the best set of IPs can be selected to make a compromise among dependability, associated overheads and performance constraints. This is a generic concept and can be used for other electronic systems as well. Furthermore, a single spare unit instead of two spare units can be used as a duplicate system based on a workload sharing mechanism with built-in monitoring and control mechanism for AMS systems (Chapter 3, section 3.9.3). In order to solve dependability issues introduced as a result of fabrication-related process-induced initial-value dependent degradation effects, a new workflow has been introduced. Usually, it is assumed that the performance degradation (aging) behavior of similar systems having different initial performance values as a result of process variations is similar. However, it has been observed that the degradation behavior of similar performance parameters having different initial values, as a result of process variations, is different. In some cases degradation will be faster and in some other cases it will be slower. It can even become worse in the case the degradation direction is changing during the product life time and can be random. This requires a different monitoring mechanism based on a database of system specifications and regularly monitored logged values of these critical performance parameters. The new monitoring mechanisms observe the performance of critical performance parameters at regular intervals of times and use this information to decide intelligently at what time it has to take digital tuning or replacement actions (Chapter 4). Furthermore, in order to reduce the sub-system loading and complexity problems while monitoring system performance parameters, an indirect novel technique is presented. In this technique, a set of degraded values of critical performance parameters along with corresponding stress conditions is extracted during design-stage simulations. By establishing a numerical relation between these degraded values of the performance parameter and the associated stress conditions, a mechanism can be employed to extract performance degradations as a function of time, based on the regular measurements of stressors. These working-stress voltages and working-stress temperatures are regularly monitored and corresponding performance degradations can be estimated based on the design-stage calculations (Chapter 4). Despite all these improvements to make AMS systems more dependable, one of the fundamental issues, that is mostly ignored in reliability calculations, is to separate short-term and long-term dependability issues. Short-term issues are caused by short-term or temporary changes in the working-stress voltages and working-stress temperatures. Therefore, it is important to differentiate and separate these short-term variations and long-term degradation effects in order to efficiently enhance the dependability of AMS systems (Chapter 5). The short-term variations can cause temporary changes in the performance of an AMS interface and can be reduced or removed by bringing the operating stress conditions back to their normal specifications. On the other hand, the long-term dependability issues in the AMS interface, being produced as a result of the cumulative nature of degradation/aging effects, can be resolved by having similar digitally-assisted capabilities of external digital tuning of existing units or replacement of faulty units. As an example of a somewhat complex system (or IP) and being an important part of most AMS interfaces, a charge-redistribution successive approximation register (SAR) ADC, has been considered to investigate the reliability issues in its static and dynamic performance parameters based on the proposed system-level behavioral-model degradation simulations of its sub-blocks. This information is then further used to discuss the potential solutions in order to enhance the dependability of the SAR ADC (Chapter 6). #### 7.2 Answers to Research Questions Based on the summary presented above, the research questions mentioned in Chapter 1 can be answered as: - What type of hardware architecture can be used to address the technology-scaling related temporal-degradation issues in AMS systems during their operational life? - A hardware architecture based on spare AMS interfaces connected via electronic switches and having built-in digital repairing capabilities along with monitoring mechanisms can be used to enhance the dependability of analog and mixed-signal front ends during their operational life (Chapter 3). - How optimization can be achieved among different dependability requirements and other issues like area, power, speed etc. in AMS systems? - A dependable library-based optimization technique for selecting the best combination of system modules or sub-modules can be used to establish a compromise between required different dependability attributes and other issues like area, power, speed etc. for AMS systems (Chapter 3). 7 - What type of improvements will be necessary in the hardware architecture to address the initial-value dependent degradation issues in AMS systems? An improved dependable hardware architecture with runtime reliability estimation mechanism can be used to address the initial-value dependent degradation issues and subsequently this information can be used to intelligently take the necessary repair actions in order to improve the dependability of AMS front ends (Chapter 4). - What type of methodologies can be used in order to indirectly estimate the performance of AMS systems during their operational life? A novel technique where design-stage simulations are used to extract a set of degradation values of critical performance parameters along with corresponding stress conditions. This information can be further used to indirectly estimate the performance of AMS systems during their operational lifetime (Chapter 4). - What (additional) actions will be required to distinguish between time-dependent variations and dynamic variations (i.e. long-term and short-term variations) in order to enhance the dependability of AMS systems? A modified system-level technique based on the regular monitoring of working stress voltages and temperatures can be used to address the time-dependent and dynamic variations (i.e. long-term and short-term variations respectively) separately during the operational life. This differentiation can be further used to better manage the dependability of AMS systems during short-term and long-term variations (Chapter 5). - What type of alternate methods, as compared to conventional device-level simulations, can be used to analyze/investigate time-dependent variations/degradations in complex AMS systems (e.g. analog-to-digital converters)? A new way of using behavioral models of a system and its sub-systems and the corresponding degradation information of its sub-systems can be used to investigate the degradation issues in the complete system. This method can be used to analyze/investigate the degradation effects in complex AMS systems (e.g. ADCs) that are usually extremely time consuming using conventional circuit-level simulations (Chapter 6). #### 7.3 CONCLUSIONS AND MAIN CONTRIBUTIONS OF OUR RESEARCH WORK In this section the conclusions and the main contributions of the research in this thesis are explained under several sub headings. #### 7.3.1 THE DEPENDABLE HARDWARE PLATFORM A hardware platform based on digitally-assisted redundant IPs for a general AMS front end has been presented. This hardware platform is different from the conventional [153] TMR system in the sense that it has no voter. Rather it has a built-in performance monitoring mechanism and all spare IPs are not operational at all as they are powered off. Only one IP of a particular kind (e.g. OpAmp) is active at any time. The performance of an active IP is regularly monitored and controlled via built-in monitoring and digital control mechanisms respectively. In case of performance deviations, as compared to allowed design specifications, first digital tuning/adjustment capabilities are utilized to bring the performance back to its normal values. Secondly, if the digital tuning/adjustment options are insufficient, the complete IP is replaced with another spare IP via switches. Therefore, it can be concluded that by monitoring performance and taking necessary repair actions will improve the dependability of AMS front ends (Chapter 3) [Khallal. #### 7.3.2 THE LIBRARY OF DEPENDABLE IPS A generic concept based on a library of dependable IPs has been presented to optimize the dependability, area, speed and power requirements set by the user or application. The idea is to design a number of flavors of the same functional IP but having different values for reliability, maintainability, availability, area, speed, and power requirements. Then the best combination of IPs is selected that can meet the dependability, area, speed and power requirements set by the user or application for the required system. In conclusion a compromise can be established between the available resources (library) and desired system requirements (Chapter 3) [Kha11b]. #### 7.3.3 THE DEPENDABLE WORKLOAD-SHARING DUPLICATION SYSTEM To overcome a major area overhead, a duplication system having built-in monitoring and digital control mechanism is proposed in order to enhance the dependability of AMS front ends. The new or enhanced hardware platform proposed in this technique is based on duplicated IPs; one IP of a particular kind (e.g. OpAmp) is being active at any time. This hardware platform is different from the above mentioned "Dependable Hardware Platform". The idea is to use one IP at any time and at regular intervals of time, or at calculated points in time, activate the other IP so that the first IP can be properly diagnosed and repaired for any performance degradations leaving a perfectly functioning IP for the correct operation. After another regular interval of time, or after a calculated interval of time, activate the first IP for normal operation so that the second IP can be properly diagnosed and repaired for any potential performance degradations. In order to avoid any failures in the current active module, the duration of its normal operation can be selected intelligently in such a way that before the failure occurs (as a result of degradation) the other IP should be activated for its normal operation and the current IP undergoes diagnosis and repair procedure. By this way of dealing with the system, the availability will be increased to its maximum; limited by switching time only. Therefore, it can be concluded that by using this approach the target of a more reliable, maintainable, and available AMS front end with less area overhead could be achieved (Chapter 3) [Kha11b]. #### 7.3.4 PROCESS-INDUCED INITIAL-VALUE DEPENDENT WORKFLOW An important finding, that has been implemented to improve the dependability of AMS front end during its operational life, is the dependence of the degradation rate on the initial value of the performance parameter produced as a result of process-induced variations. This implies that the runtime performance degradation and lifetime predictions will be harder to determine during the design-stage reliability simulations. Therefore, it necessitates the requirement to use runtime reliability-estimation techniques. These techniques have been used in the proposed dependability workflow where design-stage specification boundaries along with the runtime reliability-estimation calculations have been subsequently used to intelligently take the necessary repair actions in order to improve the dependability of AMS front ends. In conclusion, a workflow to estimate runtime reliability based on runtime performance monitoring is essential to enhance the dependability of AMS front ends against process-induced initial-value dependent variations (Chapter 4) [Kha13c]. #### 7.3.5 DIRECT RUNTIME RELIABILITY-ESTIMATION TECHNIQUE A runtime reliability estimation technique based on the estimates of instantaneous-time-to-failure (*iTTF*) has been presented. The idea is to use the performance degradation information of a critical performance parameter with respect to degradation effects at two different points in time and use this information to estimate the remaining time before the failure occurs (i.e. when the performance parameter moves beyond its designed allowed specification). In this way, an approximate value of the time can be estimated for which the system will perform correctly or in other words the quantitative runtime reliability can be estimated. Therefore, a conclusion can be drawn that the quantitative runtime reliability of AMS front ends can be estimated based on the estimates of instantaneous-time-to-failure (*iTTF*) of a critical performance parameter as a result of degradation (Chapter 4) [Kha13c]. #### 7.3.6 Indirect Runtime Reliability-Estimation Technique A novel idea to indirectly estimate the runtime reliability of electronic systems, based on the runtime measurements of working-stress voltage and working-stress temperature, has been presented. The idea is to use the design-stage degradation/aging simulations for a critical performance parameter to get a set of values, corresponding to a range of actual working-stress voltages and working-stress temperature of the system, and store these values in a database. These values will provide a relationship between the individual stress conditions and the corresponding degradations in the critical performance parameter. Later on, during the operational life, a continuous estimate of the working-stress voltage and working-stress temperature along with the corresponding stress-time information will provide an indirect way of estimating the corresponding degradation in the performance parameter. This is rather different from the technique mentioned in the previous section where the degradation in the performance parameter is directly measured. This performance degradation information at two different points in time will be subsequently used to estimate the remaining time before the failure occurs (i.e. when the performance parameter moves beyond its designed specification) and hence a way to estimate the runtime reliability. In conclusion, an indirect runtime reliability estimation technique can be used for AMS front ends in place of the direct technique mentioned in the above section. Which can be further used to enhance the dependability of AMS front ends during their operational life (Chapter 4) [Kha13b]. # 7.3.7 DIFFERENTIATING BETWEEN SHORT-TERM AND LONG-TERM DEPENDABILITY ISSUES In order to better manage the dependability improvement strategy, an important differentiation which has been implemented has also been discussed. It has been observed that the working-stress voltage and working-stress temperature can change for a short-term or temporary duration that can create short-term or temporary degradations in the performance of a system. Therefore, it is important to differentiate between the short-term and long-term dependability issues. Long-term dependability issues result from aging effects and are cumulative in nature. Short-term dependability issues can be solved by monitoring the corresponding changes in the working-stress voltages and working-stress temperatures. However, long-term dependability issues can be solved by utilizing the same digital repairing strategies as previously discussed in the "The Dependable Hardware Platform" (section 7.3.1). Therefore, it can be concluded that the separation of short-term and long-term dependability issues is required in order to further manage the overall dependability of AMS front ends (Chapter 5) [Kha13a]. #### 7.3.8 BEHAVIORAL MODEL-BASED DEGRADATION ANALYSIS SYSTEM Behavioral models have been frequently used to analyze the performance of electronic systems especially due to their faster speed of simulation. However, in this research we have presented the idea of using behavioral models in order to simulate the degradation effects in electronic systems. Usually, it is considered complex and very time consuming to simulate degradation/aging effects in complex systems on the basis of transistor-based simulators. Therefore, the presented idea is to subdivide the complex or bigger system into smaller sub-systems and use the degradation information of each sub-system in their respective behavioral models to simulate the degradation effects in the whole system. The degradation effects of smaller sub-systems can be either extracted from the transistor-based simulators or they can be experimentally measured (also very time consuming and expensive) directly from accelerated life tests. An alternate, less accurate way could be to use an expected range of degraded values. A number of system-level simulation tools like VHDL-AMS and LabVIEW have been used to implement the presented idea (Chapters 3 and 6). In conclusion, behavioral models can also be potentially used to study the degradation effects in complex systems that are much faster as compared to transistor-based simulators [Kha13d, Kha14]. #### 7.3.9 A FLEXIBLE DEGRADATION-ANALYSIS SYSTEM FOR SAR ADCS Performance degradation analysis of ADCs using transistor-based simulators is considered to be very time consuming and often inefficient. To resolve this issue for a SAR ADC, a flexible degradation-analysis system based on system-level behavioral models (as discussed in the above section) has been implemented in the LabVIEW 7 environment. The analysis system is capable of simulating the degradation effects in the static and dynamic performance parameters of a SAR ADC including offset, gain, DNLE, INLE, SINAD, THD, and ENOB. In this research work the degradation effects introduced by the buffer, comparator and DAC capacitor array have been considered. However, this concept can be further extended to include the degradation effects of other sub-systems of the SAR ADC as well or can be extended to other ADC architectures or mixed-signal systems. This concludes that the behavioral-model based degradation-analysis system can be used to study the degradation effects in the static and dynamic performance parameters of a SAR ADC which are further used in proposing the dependability enhancement strategies as discussed in Chapter 6 [Kha13d, Kha14]. #### 7.4 Possible Limitations of this Research Work In this section the possible limitations of the proposed techniques presented in this research work will be briefly discussed. - The dependable hardware platform presented in section 7.3.1 could face implementation problems. The monitoring and controlling part will be quite a challenge for AMS designers. The switching part can introduce leakage, noise and bandwidth problems. - Designing a library of dependable IPs (section 7.3.2) for a number of electronic circuits/IPs will be quite time consuming and tedious work. However, once solved, this will become an asset for electronic industry. - In the dependable workload-sharing duplication system, presented in section 7.3.3, a complete failure of one IP can cause availability problems because, in this case, the system will be unavailable due to diagnosis and repair. Although, the system will be available with a poor availability but will be much better as compared to the conventional TMR system where a single IP (module) is of no use because at least two IPs (modules) are required to decide a correct operation. Furthermore, the switches (SW<sub>1</sub> and SW<sub>2</sub>) used to select the current active path can lead to single points of failure if fault-tolerant switching architecture is not considered. - The indirect runtime reliability estimation technique (section 7.3.6) mainly relies on the set of values extracted from the design-stage degradation simulations. Therefore, the accuracy of the presented technique will be limited by the accuracy of the simulation results to real environmental results. Furthermore, the accuracy and speed of measuring input-stress voltage and working-stress temperature during operational life (section 4.9) will also affect the overall accuracy of the presented technique. - Usually, bringing back the supply voltage to normal operating values, as discussed in Chapter 5 (section 5.4), can be accomplished with the help of digitally-tunable power supplies. However, bringing back the operating temperature to normal operating values will be quite a challenge for system designers. (157) - The overall speed of diagnosing and repairing the SAR ADC performance as discussed in the proposed dependability enhancement strategy (section 6.7.3) could be much higher and can affect the overall dependability. However, by using an alternate (spare) SAR ADC during diagnosis and repairing (as discussed in section 3.9.3) process will solve the issue. #### 7.5 FUTURE WORK AND RECOMMENDATIONS The research work presented in this thesis mainly focuses on system-level techniques that can be used to enhance the dependability of AMS SOCs; it mainly relies on system-level simulations. However, to further strengthen and validate the presented concepts, a strong industrial collaboration is required. It is important to implement the presented concepts in the respective aging-sensitive technology nodes. This concept has to be implemented gradually. One has to start with the simple implementation of analog and mixed-signal front ends and studying their performance degradation at system level using simulation tools and validation by actual accelerated life tests. Then the actual implementation of the presented techniques have to be validated by highly accelerated life tests (HALT) and highly accelerated stress tests (HAST). Further improvements can be found in the digital domain where new algorithms or software techniques have to be proposed to cope with degradation effects. This can be potentially solved and implemented at the software side. The foundation of a new research direction can also be established based on the behavioral model-based degradation simulations for complex systems. However, as the technology trends are reducing device dimensions and the future electronic systems are becoming more compact and complex, it seems to be the evident choice to simulate and evaluate the dependability issues and take the proper countermeasures. #### 7.6 REFERENCES [Kha11a] M.A. Khan, and H.G. Kerkhoff, "A System-Level Platform for Dependability Enhancement and its Analysis for Mixed-Signal SoCs," in IEEE Int. Symp. Design and Diagnostics of Electronic Circuits & Systems (DDECS), pp. 17-22, 2011. [Kha11b] M.A. Khan, and H.G. Kerkhoff, "SoC Mixed-Signal Dependability Enhancement: A Strategy from Design to End-of-Life," in IEEE Int. Symp. Defect and Fault Tolerance in VLSI and Nanotechnology Systems (DFT), pp. 374-381, 2011. [Kha13a] M.A. Khan, and H.G. Kerkhoff, "Monitoring Operating Temperature and Supply Voltage in Achieving High System Dependability," in IEEE Int. Conf. Design & Technology of Integrated Systems (DTIS), pp. 112-116, 2013. [Kha13b] M.A. Khan, and H.G. Kerkhoff, "An Indirect Technique for Estimating Reliability of Analog and Mixed-Signal Systems during Operational Life," in IEEE Int. Symp. Design and Diagnostics of Electronic Circuits & Systems (DDECS), pp. 159-164, 2013. [Kha13c] M.A. Khan, and H.G. Kerkhoff, "The Essence of Reliability Estimation during Operational Life for Achieving High System Dependability," in IEEE Euromicro Conference on Digital System Design (DSD), pp. 575-581, 2013. [Kha13d] M.A. Khan, and H.G. Kerkhoff, "Analysing Degradation Effects in Charge-Redistribution SAR ADCs," in IEEE Int. Symp. Defect and Fault Tolerance in VLSI and Nanotechnology Systems (DFT), pp. 65-70, 2013. [Kha14] M.A. Khan, and H.G. Kerkhoff, "Studying DAC Capacitor-Array Degradation in Charge-Redistribution SAR ADCs," in IEEE Int. Symp. Design and Diagnostics of Electronic Circuits & Systems (DDECS), pp. 15-20, 2014. 159 T E R H A P ## **ABBREVIATIONS** AC Alternate Current ADC Analogue-to-Digital Converter ADE Analogue Design Environment ALT Accelerated Life Testing AMS Analog and Mixed-Signal ATB Analogue Test Bus Av Availability BERT Berkeley Reliability Tools BiCMOS Bipolar Complementary Metal Oxide Semiconductor BTI Biased Temperature Instability BWC Bit-Wise Correlation CHE Channel Hot Electron CMMR Common Mode Rejection Ratio CMOS Complementary Metal Oxide Semiconductor CMU Central Measurement Unit CR Charge Redistribution DAC Digital-to-Analogue Converter DAHC Drain Avalanche Hot Carrier DC Direct Current DFR Design-For-Reliability DNL Differential Non-Linearity DNLE Differential Non-Linearity Error DS Duplicate System DTB Digital Test Bus DUT Devices-under-Test DUT EATB External Analogue Test Bus EM Electro-Migration ENOB Effective Number of Bits FFT Fast Fourier Transform FIT Failure in Time FMEA Failure Mode and Effects Analysis FPGA Field Programmable Gate Array FS Full Scale FSR Full Scale Range GUI Graphical User Interface HALT Highly Accelerated Life Test HAST Highly Accelerated Stress Test HCI Hot Carrier Injection IATB Internal Analogue Test Bus IC Integrated Circuit INL Integral Non-Linearity INLE Integral Non-Linearity Error IO Input / Output Interface IP Intellectual Property IPHP Initial Proposed Hardware Platform ITTF Instantaneous Time To Failure LabVIEW Laboratory Virtual Engineering Workbench LDD Lightly Doped Drain LER Line-Edge Roughness LMU Local Measurement Unit LP Linear Programming LSB Least Significant Bit MATLAB MATRIX LABoratory MDT Mean Down Time MEMS Micro Electro-Mechanical Systems MIM Metal-Insulator-Metal MOS Metal Oxide Semiconductor MOSFET Metal Oxide Semiconductor Field Effect Transistor MOSRA Metal Oxide Semiconductor Reliability Analysis MSB Most Significant Bit MTBF Mean Time before Failures MTTF Mean Time to Failure MTTR Mean Time to Repair NBTI Negative Bias Temperature Instability NMOS N-Channel Metal Oxide Semiconductor NMOSFET N-Channel Metal Oxide Semiconductor Field Effect Transistor NPHP New Proposed Hardware Platform NRIP No Repairable Intellectual Properties OpAmp Operational Amplifier OTF Oxide Thickness Fluctuation PBTI Positive Bias Temperature Instability PDF Probability Density Function PDK Process Design Kit PHP Proposed Hardware Platform PMOS P-Channel Metal Oxide Semiconductor PMOSFET P-Channel Metal Oxide Semiconductor Field Effect Transistor PSRR Power Supply Rejection Ratio PVT Process-Voltage-Temperature RBD Reliability Block Diagram RDF Random Dopant Fluctuation Rel Reliability RF Radio-Frequency SAR Successive Approximation Register SBD Soft Break Down SINAD Signal-to-Noise-And-Distortion SiO2 Silicon Di-Oxide SIP Sub- Intellectual Property SNR Signal-to-Noise Ratio SoC System-on-Chip SPICE Simulation Program with Integrated Circuit Emphasis SRAM Static Random Access Memory TBF Time Before Failure TDDB Time-Dependent Dielectric Breakdown THD Total Harmonic Distortion TMR Triple Modular Redundancy TS Triplicate System TSMC Taiwan Semiconductor Manufacturing Company TTR Time To Repair VCC Voltage at a Common Connector VCO Voltage Controlled Oscillator VCO Voltage Controlled Oscillator VDD Voltage Drain Drain VHDL Very High Density Logic VLSI Very Large Scale Integration Vth Threshold voltage WIR Wrapper Instruction Register WRB Wrapper Boundary Register WRIP With Repairable Intellectual Properties # LIST OF PUBLICATIONS - [MAK:1]. M.A. Khan, and H.G. Kerkhoff, "Studying DAC Capacitor-Array Degradation in Charge-Redistribution SAR ADCs," in IEEE Int. Symp. Design and Diagnostics of Electronic Circuits & Systems (DDECS), pp. 15-20, 2014. - [MAK:2]. M.A. Khan, and H.G. Kerkhoff, "Analysing Degradation Effects in Charge-Redistribution SAR ADCs," in IEEE Int. Symp. Defect and Fault Tolerance in VLSI and Nanotechnology Systems (DFT), pp. 65-70, 2013. - [MAK:3]. M.A. Khan, and H.G. Kerkhoff, "The Essence of Reliability Estimation during Operational Life for Achieving High System Dependability," in IEEE Euromicro Conference on Digital System Design (DSD), pp. 575-581, 2013. - [MAK:4]. M.A. Khan, and H.G. Kerkhoff, "An Indirect Technique for Estimating Reliability of Analog and Mixed-Signal Systems during Operational Life," in IEEE Int. Symp. Design and Diagnostics of Electronic Circuits & Systems (DDECS), pp. 159-164, 2013. - [MAK:5]. M.A. Khan, and H.G. Kerkhoff, "Monitoring Operating Temperature and Supply Voltage in Achieving High System Dependability," in IEEE Int. Conf. Design & Technology of Integrated Systems (DTIS), pp. 112-116, 2013. - [MAK:6]. M.A. Khan, and H.G. Kerkhoff, "SoC Mixed-Signal Dependability Enhancement: A Strategy from Design to End-of-Life," in IEEE Int. Symp. Defect and Fault Tolerance in VLSI and Nanotechnology Systems (DFT), pp. 374-381, 2011. - [MAK:7]. M.A. Khan, and H.G. Kerkhoff, "A System-Level Platform for Dependability Enhancement and its Analysis for Mixed-Signal SoCs," in IEEE Int. Symp. Design and Diagnostics of Electronic Circuits & Systems (DDECS), pp. 17-22, 2011. # **BIOGRAPHY** 167 Muhammad Aamir Khan is a young professional who has multidisciplinary education and experience. He received his B.Sc. degree in Physics and Mathematics from the University of the Punjab (PU), Lahore, Pakistan in 1999. In 2001, he received his M.Sc. degree in Physics from the same university. From October 2002 till September 2004, he worked as a research student at the Pakistan Institute of Engineering and Applied Sciences (PIEAS), Islamabad, Pakistan and received his M.Sc. degree in Systems Engineering from the same university in 2004. In September 2004, he joined a Research and Development Organization in Islamabad, Pakistan as a Scientific Officer and was promoted to Senior Scientific Officer in the same organization in December 2006. In September 2007, he went to Sweden for further studies and received his M.Sc. degree in System-on-Chip Design from the Royal Institute of Technology (KTH), Stockholm, Sweden in 2009. After finishing his studies in Sweden he started working towards his PhD degree at the University of Twente, Enschede, the Netherlands and received his PhD degree in November 2014. Throughout his academic and professional career, he has remained involved in multidisciplinary fields. His interests include Digital, Analog and Mixed-Signal Electronics, Embedded Hardware and Software, DSP, Electro-Mechanical and Control Systems, Image Processing, Modeling and Simulation, Data Analysis, Mathematics and Physics. During his PhD research, he published several publications in the field of electronic-system dependability modeling, simulation, analysis and different dependability enhancement and improvement strategies. ## **PROPOSITIONS** accompanying the PhD dissertation # On Improving Dependability of Analog and Mixed-Signal SoCs: A System-Level Approach #### by Muhammad Aamir Khan - 1. The dependability of analog and mixed-signal systems, being an important part of most critical systems especially in automotive, medical and military systems, has received little attention (this thesis). - 2. It will be difficult to meet the dependability requirements during the operational life of systems using design-time tuning of the electronic systems only (this thesis). - 3. Performance estimations during the operational life of a system are crucial for a dependable system (this thesis). - 4. Gaining system dependability improvement has its price in terms of area, speed, power, and cost (this thesis). - 5. In order to improve the system dependability, the short-term and long-term effects should be addressed separately (this thesis). - Improvement in system dependability is dependent on the accuracy and dependability of its monitoring and tuning mechanism (this thesis). - 7. We have to live with dependability issues. - 8. The current trend of funding risk-free and guaranteed profitable research activities by governments and third parties will actually hinder the progress of innovation. - 9. Science is just like a language that is used to understand some of the hidden laws of this universe; we need more languages to understand them all. - 10. Solving lifetime dependability issues of electronic chips will require a good electronic doctor on chip. These propositions are regarded as opposable and defendable, and have been approved as such by promoter, Prof.dr.ir. G.J.M Smit.